This paper describes a weakly supervised end-to-end model for estimating 3D human pose from a single image. The model is trained by reprojecting 3D poses to 2D poses for matching ground truth 2D poses for supervision, with minimal need for 3D labels. A mathematical camera model, utilizing intrinsic and extrinsic parameters, enables accurate reprojection and we use EPnP algorithm to estimate precise reprojection. While the uncertainty-aware PnP algorithm further improves the accuracy of estimated reprojection by considering the uncertainty of joint estimation. Further, an adversarial generative network, employing a Transformer-based encoder as generator, is used to predict 3D pose, which utilizes self-attention mechanism to establish dependencies between joints, and fuses features from an edge detection module and a 2D pose estimation module for constraint and spatial information. The model’s efficient reprojection method enables competitive results on Human3.6M and MPI-INF-3DHP, among weakly supervised methods, about 2.5% and 2.45% improvement respectively.
{"title":"Weakly supervised 3D human pose estimation based on PnP projection model","authors":"Xiaoyan Zhang , Yunlai Chen , Huaijing Lai , Hongzheng Zhang","doi":"10.1016/j.patcog.2025.111464","DOIUrl":"10.1016/j.patcog.2025.111464","url":null,"abstract":"<div><div>This paper describes a weakly supervised end-to-end model for estimating 3D human pose from a single image. The model is trained by reprojecting 3D poses to 2D poses for matching ground truth 2D poses for supervision, with minimal need for 3D labels. A mathematical camera model, utilizing intrinsic and extrinsic parameters, enables accurate reprojection and we use EPnP algorithm to estimate precise reprojection. While the uncertainty-aware PnP algorithm further improves the accuracy of estimated reprojection by considering the uncertainty of joint estimation. Further, an adversarial generative network, employing a Transformer-based encoder as generator, is used to predict 3D pose, which utilizes self-attention mechanism to establish dependencies between joints, and fuses features from an edge detection module and a 2D pose estimation module for constraint and spatial information. The model’s efficient reprojection method enables competitive results on Human3.6M and MPI-INF-3DHP, among weakly supervised methods, about 2.5% and 2.45% improvement respectively.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111464"},"PeriodicalIF":7.5,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143464529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-18DOI: 10.1016/j.patcog.2025.111466
Xiaoxu Li , Shuo Ding , Jiyang Xie , Xiaochen Yang , Zhanyu Ma , Jing-Hao Xue
The fine-grained few-shot classification is a challenging task in computer vision, aiming to classify images with subtle and detailed differences given scarce labeled samples. A promising avenue to tackle this challenge is to use spatially local features to densely measure the similarity between query and support samples. Compared with image-level global features, local features contain more low-level information that is rich and transferable across categories. However, methods based on spatially localized features have difficulty distinguishing subtle category differences due to the lack of sample diversity. To address this issue, we propose a novel method called Cross-view Deep Nearest Neighbor Neural Network (CDN4). CDN4 applies a random geometric transformation to augment a different view of support and query samples and subsequently exploits four similarities between the original and transformed views of query local features and those views of support local features. The geometric augmentation increases the diversity between samples of the same class, and the cross-view measurement encourages the model to focus more on discriminative local features for classification through the cross-measurements between the two branches. Extensive experiments validate the superiority of CDN4, which achieves new state-of-the-art results in few-shot classification across various fine-grained benchmarks. Code is available at .
{"title":"CDN4: A cross-view Deep Nearest Neighbor Neural Network for fine-grained few-shot classification","authors":"Xiaoxu Li , Shuo Ding , Jiyang Xie , Xiaochen Yang , Zhanyu Ma , Jing-Hao Xue","doi":"10.1016/j.patcog.2025.111466","DOIUrl":"10.1016/j.patcog.2025.111466","url":null,"abstract":"<div><div>The fine-grained few-shot classification is a challenging task in computer vision, aiming to classify images with subtle and detailed differences given scarce labeled samples. A promising avenue to tackle this challenge is to use spatially local features to densely measure the similarity between query and support samples. Compared with image-level global features, local features contain more low-level information that is rich and transferable across categories. However, methods based on spatially localized features have difficulty distinguishing subtle category differences due to the lack of sample diversity. To address this issue, we propose a novel method called Cross-view Deep Nearest Neighbor Neural Network (CDN4). CDN4 applies a random geometric transformation to augment a different view of support and query samples and subsequently exploits four similarities between the original and transformed views of query local features and those views of support local features. The geometric augmentation increases the diversity between samples of the same class, and the cross-view measurement encourages the model to focus more on discriminative local features for classification through the cross-measurements between the two branches. Extensive experiments validate the superiority of CDN4, which achieves new state-of-the-art results in few-shot classification across various fine-grained benchmarks. Code is available at .</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111466"},"PeriodicalIF":7.5,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143464527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-18DOI: 10.1016/j.patcog.2025.111459
Chaithra Umesh , Kristian Schultz , Manjunath Mahendra , Saptarshi Bej , Olaf Wolkenhauer
Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.
{"title":"Preserving logical and functional dependencies in synthetic tabular data","authors":"Chaithra Umesh , Kristian Schultz , Manjunath Mahendra , Saptarshi Bej , Olaf Wolkenhauer","doi":"10.1016/j.patcog.2025.111459","DOIUrl":"10.1016/j.patcog.2025.111459","url":null,"abstract":"<div><div>Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111459"},"PeriodicalIF":7.5,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-18DOI: 10.1016/j.patcog.2025.111465
Liyuan Chen , Dawei Zhang , Xiao Wang , Chang Wan , Shan Jin , Zhonglong Zheng
Leveraging scribble annotations for weakly supervised salient object detection significantly reduces reliance on extensive, precise labels during model training. To optimize the use of these sparse annotations, we introduce a novel framework called the Complementary Reliable Region Aggregation Network (CRANet). This framework utilizes a dual-model framework that integrates complementary information from two models with the same architecture but different parameters: a foreground model that generates the saliency map and a background model that identifies regions excluding salient objects. By merging the outputs of both models, we propose a reliable pseudo-label aggregation strategy that expands the supervision capability of scribble annotations, eliminating the necessity for predefined thresholds and other parameterized modules. High-confidence predictions are then combined to create pseudo labels that guide the training process of both models. Additionally, we incorporate a flipping consistency method and a flipped guided loss function to enhance prediction consistency and increase the scale of the training set, effectively addressing the challenges posed by sparse and structurally constrained scribble annotations. Experimental results demonstrate that our approach significantly outperforms existing methods.
{"title":"A complementary dual model for weakly supervised salient object detection","authors":"Liyuan Chen , Dawei Zhang , Xiao Wang , Chang Wan , Shan Jin , Zhonglong Zheng","doi":"10.1016/j.patcog.2025.111465","DOIUrl":"10.1016/j.patcog.2025.111465","url":null,"abstract":"<div><div>Leveraging scribble annotations for weakly supervised salient object detection significantly reduces reliance on extensive, precise labels during model training. To optimize the use of these sparse annotations, we introduce a novel framework called the Complementary Reliable Region Aggregation Network (CRANet). This framework utilizes a dual-model framework that integrates complementary information from two models with the same architecture but different parameters: a foreground model that generates the saliency map and a background model that identifies regions excluding salient objects. By merging the outputs of both models, we propose a reliable pseudo-label aggregation strategy that expands the supervision capability of scribble annotations, eliminating the necessity for predefined thresholds and other parameterized modules. High-confidence predictions are then combined to create pseudo labels that guide the training process of both models. Additionally, we incorporate a flipping consistency method and a flipped guided loss function to enhance prediction consistency and increase the scale of the training set, effectively addressing the challenges posed by sparse and structurally constrained scribble annotations. Experimental results demonstrate that our approach significantly outperforms existing methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111465"},"PeriodicalIF":7.5,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-17DOI: 10.1016/j.patcog.2025.111467
Xu Tan , Jiawei Yang , Junqi Chen , Sylwan Rahardja , Susanto Rahardja
The Autoencoder (AE) is popular in Outlier Detection (OD) now due to its strong modeling ability. However, AE-based OD methods face the unexpected reconstruction problem: outliers are reconstructed with low errors, impeding their distinction from inliers. This stems from two aspects. First, AE may overconfidently produce good reconstructions in regions where outliers or potential outliers exist while using the mean squared error. To address this, the aleatoric uncertainty was introduced to construct the Probabilistic Autoencoder (PAE), and the Weighted Negative Log-Likelihood (WNLL) was proposed to enlarge the score disparity between inliers and outliers. Second, AE focuses on global modeling yet lacks the perception of local information. Therefore, the Mean-Shift Scoring (MSS) method was proposed to utilize the local relationship of data to reduce the false inliers caused by AE. Moreover, experiments on 32 real-world OD datasets proved the effectiveness of the proposed methods. The combination of WNLL and MSS achieved 45% relative performance improvement compared to the best baseline. In addition, MSS improved the detection performance of multiple AE-based outlier detectors by an average of 20%. The proposed methods have the potential to advance AE’s development in OD.
{"title":"MSS-PAE: Saving Autoencoder-based Outlier Detection from Unexpected Reconstruction","authors":"Xu Tan , Jiawei Yang , Junqi Chen , Sylwan Rahardja , Susanto Rahardja","doi":"10.1016/j.patcog.2025.111467","DOIUrl":"10.1016/j.patcog.2025.111467","url":null,"abstract":"<div><div>The Autoencoder (AE) is popular in Outlier Detection (OD) now due to its strong modeling ability. However, AE-based OD methods face the unexpected reconstruction problem: outliers are reconstructed with low errors, impeding their distinction from inliers. This stems from two aspects. First, AE may overconfidently produce good reconstructions in regions where outliers or potential outliers exist while using the mean squared error. To address this, the aleatoric uncertainty was introduced to construct the Probabilistic Autoencoder (PAE), and the Weighted Negative Log-Likelihood (WNLL) was proposed to enlarge the score disparity between inliers and outliers. Second, AE focuses on global modeling yet lacks the perception of local information. Therefore, the Mean-Shift Scoring (MSS) method was proposed to utilize the local relationship of data to reduce the false inliers caused by AE. Moreover, experiments on 32 real-world OD datasets proved the effectiveness of the proposed methods. The combination of WNLL and MSS achieved 45% relative performance improvement compared to the best baseline. In addition, MSS improved the detection performance of multiple AE-based outlier detectors by an average of 20%. The proposed methods have the potential to advance AE’s development in OD.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111467"},"PeriodicalIF":7.5,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-17DOI: 10.1016/j.patcog.2025.111457
Rui Ming , Yixian Xiao , Xinyu Liu , Guolong Zheng , Guobao Xiao
Visible and infrared image fusion aims to generate fused images with comprehensive scene understanding and detailed contextual information. However, existing methods often struggle to adequately handle relationships between different modalities and optimize for downstream applications. To address these challenges, we propose a novel scene-semantic decomposition-based approach for visible and infrared image fusion, termed SSDFusion. Our method employs a multi-level encoder-fusion network with fusion modules implementing the proposed scene-semantic decomposition and fusion strategy to extract and fuse scene-related and semantic-related components, respectively, and inject the fused semantics into scene features, enriching the contextual information in fused features while sustaining fidelity of fused images. Moreover, we further incorporate meta-feature embedding to connect the encoder-fusion network with the downstream application network during the training process, enhancing our method’s ability to extract semantics, optimize the fusion effect, and serve tasks such as semantic segmentation. Extensive experiments demonstrate that SSDFusion achieves state-of-the-art image fusion performance while enhancing results on semantic segmentation tasks. Our approach bridges the gap between feature decomposition-based image fusion and high-level vision applications, providing a more effective paradigm for multi-modal image fusion. The code is available at https://github.com/YiXian-Xiao/SSDFusion.
{"title":"SSDFusion: A scene-semantic decomposition approach for visible and infrared image fusion","authors":"Rui Ming , Yixian Xiao , Xinyu Liu , Guolong Zheng , Guobao Xiao","doi":"10.1016/j.patcog.2025.111457","DOIUrl":"10.1016/j.patcog.2025.111457","url":null,"abstract":"<div><div>Visible and infrared image fusion aims to generate fused images with comprehensive scene understanding and detailed contextual information. However, existing methods often struggle to adequately handle relationships between different modalities and optimize for downstream applications. To address these challenges, we propose a novel scene-semantic decomposition-based approach for visible and infrared image fusion, termed <em>SSDFusion</em>. Our method employs a multi-level encoder-fusion network with fusion modules implementing the proposed scene-semantic decomposition and fusion strategy to extract and fuse scene-related and semantic-related components, respectively, and inject the fused semantics into scene features, enriching the contextual information in fused features while sustaining fidelity of fused images. Moreover, we further incorporate meta-feature embedding to connect the encoder-fusion network with the downstream application network during the training process, enhancing our method’s ability to extract semantics, optimize the fusion effect, and serve tasks such as semantic segmentation. Extensive experiments demonstrate that SSDFusion achieves state-of-the-art image fusion performance while enhancing results on semantic segmentation tasks. Our approach bridges the gap between feature decomposition-based image fusion and high-level vision applications, providing a more effective paradigm for multi-modal image fusion. The code is available at https://github.com/YiXian-Xiao/SSDFusion.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111457"},"PeriodicalIF":7.5,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-17DOI: 10.1016/j.patcog.2025.111462
Wuzhen Shi, Aixue Yin, Yingxiang Li, Yang Wen
Considering that edge maps can well reflect the structure of objects, we novelly propose to train an end-to-end network to reconstruct 3D models from edge maps. Since edge maps can be easily extracted from RGB images and sketches, our edge-based 3D reconstruction network (EBNet) can be used to reconstruct 3D models from both RGB images and sketches. In order to utilize both the texture and edge information of the image to obtain better 3D reconstruction results, we further propose an edge-guided 3D reconstruction network (EGNet), which enhances the perception of structures by edge information to improve the performance of the reconstructed 3D model. Although sketches have less texture information compared to RGB images, experiments show that our EGNet can also help improve the performance of reconstructing 3D models from sketches. To exploit the complementary information among different viewpoints, we further propose a multi-view edge-guided 3D reconstruction network (MEGNet) with a structure-aware fusion module. To the best of our knowledge, we are the first to use edge maps to enhance structural information for multi-view 3D reconstruction. Experimental results on the ShapeNet, Synthetic-LineDrawing benchmarks show that the proposed method outperforms the state-of-the-art methods for reconstructing 3D models from both RGB images and sketches. Ablation studies demonstrate the effectiveness of the proposed different modules.
{"title":"Edge-guided 3D reconstruction from multi-view sketches and RGB images","authors":"Wuzhen Shi, Aixue Yin, Yingxiang Li, Yang Wen","doi":"10.1016/j.patcog.2025.111462","DOIUrl":"10.1016/j.patcog.2025.111462","url":null,"abstract":"<div><div>Considering that edge maps can well reflect the structure of objects, we novelly propose to train an end-to-end network to reconstruct 3D models from edge maps. Since edge maps can be easily extracted from RGB images and sketches, our edge-based 3D reconstruction network (EBNet) can be used to reconstruct 3D models from both RGB images and sketches. In order to utilize both the texture and edge information of the image to obtain better 3D reconstruction results, we further propose an edge-guided 3D reconstruction network (EGNet), which enhances the perception of structures by edge information to improve the performance of the reconstructed 3D model. Although sketches have less texture information compared to RGB images, experiments show that our EGNet can also help improve the performance of reconstructing 3D models from sketches. To exploit the complementary information among different viewpoints, we further propose a multi-view edge-guided 3D reconstruction network (MEGNet) with a structure-aware fusion module. To the best of our knowledge, we are the first to use edge maps to enhance structural information for multi-view 3D reconstruction. Experimental results on the ShapeNet, Synthetic-LineDrawing benchmarks show that the proposed method outperforms the state-of-the-art methods for reconstructing 3D models from both RGB images and sketches. Ablation studies demonstrate the effectiveness of the proposed different modules.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111462"},"PeriodicalIF":7.5,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143464531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-17DOI: 10.1016/j.patcog.2025.111476
Quan Lu, Hongbin Zhang, Linfei Yin
Aiming at the problems of information loss and edge blurring due to the loss of gradient features that tend to occur during the fusion of infrared and visible images, this study proposes a dual encoder image fusion method (DEFusion) based on dense connectivity. The proposed method processes infrared and visible images by different means, therefore guaranteeing the best possible preservation of the features of the original image. A new progressive fusion strategy is constructed to ensure that the network is better able to capture the detailed information present in visible images while minimizing the gradient loss of the infrared image. Furthermore, a novel loss function that includes gradient loss and content loss, which ensures that the fusion results consider both the detailed information and gradient of the source image, is proposed in this study to facilitate the fusion process. The experimental results with the state-of-art methods on TNO and RoadScene datasets verify that the proposed method exhibits superior performance in most indices. The fused image exhibits excellent subjective contrast and clarity, providing a strong visual perception. The results of the comparison experiment demonstrate that this method exhibits favorable characteristics in terms of generalization and robustness.
{"title":"Infrared and visible image fusion via dual encoder based on dense connection","authors":"Quan Lu, Hongbin Zhang, Linfei Yin","doi":"10.1016/j.patcog.2025.111476","DOIUrl":"10.1016/j.patcog.2025.111476","url":null,"abstract":"<div><div>Aiming at the problems of information loss and edge blurring due to the loss of gradient features that tend to occur during the fusion of infrared and visible images, this study proposes a dual encoder image fusion method (DEFusion) based on dense connectivity. The proposed method processes infrared and visible images by different means, therefore guaranteeing the best possible preservation of the features of the original image. A new progressive fusion strategy is constructed to ensure that the network is better able to capture the detailed information present in visible images while minimizing the gradient loss of the infrared image. Furthermore, a novel loss function that includes gradient loss and content loss, which ensures that the fusion results consider both the detailed information and gradient of the source image, is proposed in this study to facilitate the fusion process. The experimental results with the state-of-art methods on TNO and RoadScene datasets verify that the proposed method exhibits superior performance in most indices. The fused image exhibits excellent subjective contrast and clarity, providing a strong visual perception. The results of the comparison experiment demonstrate that this method exhibits favorable characteristics in terms of generalization and robustness.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111476"},"PeriodicalIF":7.5,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143454213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-16DOI: 10.1016/j.patcog.2025.111463
Xu Gong , Maotao Liu , Qun Liu , Yike Guo , Guoyin Wang
Molecular property prediction is a critical task with substantial applications for drug design and repositioning. The multiplicity of molecular data modalities and paucity of labeled data present significant challenges that affect algorithmic performance in this domain. Nevertheless, conventional approaches typically focus on singular data modalities and ignore either hierarchical structural features or other data pattern information, leading to problems when expressing complex phenomena and relationships. Additionally, the scarcity of labeled data obstructs the accurate mapping of instances to labels in property prediction tasks. To address these issues, we propose the Multimodal Data Fusion-based graph Contrastive Learning framework (MDFCL) for molecular property prediction. Specifically, we incorporate exhaustive information from dual molecular data modalities, namely graph and sequence structures. Subsequently, adaptive data augmentation strategies are designed based on the molecular backbones and side chains for multimodal data. Built upon these augmentation strategies, we develop a graph contrastive learning framework and pre-train it with unlabeled data ( 10M molecules). MDFCL is tested using 13 molecular property prediction benchmark datasets, demonstrating its effectiveness through empirical findings. In addition, a visualization study demonstrates that MDFCL can embed molecules into representative features and steer the distribution of molecular representations.
{"title":"MDFCL: Multimodal data fusion-based graph contrastive learning framework for molecular property prediction","authors":"Xu Gong , Maotao Liu , Qun Liu , Yike Guo , Guoyin Wang","doi":"10.1016/j.patcog.2025.111463","DOIUrl":"10.1016/j.patcog.2025.111463","url":null,"abstract":"<div><div>Molecular property prediction is a critical task with substantial applications for drug design and repositioning. The multiplicity of molecular data modalities and paucity of labeled data present significant challenges that affect algorithmic performance in this domain. Nevertheless, conventional approaches typically focus on singular data modalities and ignore either hierarchical structural features or other data pattern information, leading to problems when expressing complex phenomena and relationships. Additionally, the scarcity of labeled data obstructs the accurate mapping of instances to labels in property prediction tasks. To address these issues, we propose the <strong>M</strong>ultimodal <strong>D</strong>ata <strong>F</strong>usion-based graph <strong>C</strong>ontrastive <strong>L</strong>earning framework (MDFCL) for molecular property prediction. Specifically, we incorporate exhaustive information from dual molecular data modalities, namely graph and sequence structures. Subsequently, adaptive data augmentation strategies are designed based on the molecular backbones and side chains for multimodal data. Built upon these augmentation strategies, we develop a graph contrastive learning framework and pre-train it with unlabeled data (<span><math><mo>∼</mo></math></span> 10M molecules). MDFCL is tested using 13 molecular property prediction benchmark datasets, demonstrating its effectiveness through empirical findings. In addition, a visualization study demonstrates that MDFCL can embed molecules into representative features and steer the distribution of molecular representations.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111463"},"PeriodicalIF":7.5,"publicationDate":"2025-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143436846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-15DOI: 10.1016/j.patcog.2025.111447
Wenfeng Song , Zhongyong Ye , Meng Sun , Xia Hou , Shuai Li , Aimin Hao
In the progressive domain of computer vision, generating high-fidelity facial images from textual descriptions with precision remains a complex challenge. While existing diffusion models have demonstrated capabilities in text-to-image synthesis, they often struggle with capturing intricate details from complex, multi-attribute textual descriptions, leading to entity or attribute loss and inaccurate combinations. We propose AttriDiffuser, a novel model designed to ensure that each entity and attribute in textual descriptions is distinctly and accurately represented in the synthesized images. AttriDiffuser utilizes a text-driven attribute diffusion adversarial model, enhancing the correspondence between textual attributes and image features. It incorporates an attribute-gating cross-attention mechanism seamlessly into the adversarial learning enhanced diffusion model. AttriDiffuser advances traditional diffusion models by integrating a face diversity discriminator, which augments adversarial training and promotes the generation of diverse yet precise facial images in alignment with complex textual descriptions. Our empirical evaluation, conducted on the renowned Multimodal VoxCeleb and CelebA-HQ datasets, and benchmarked against other state-of-the-art models, demonstrates AttriDiffuser’s superior efficacy. The results indicate its unparalleled capability to synthesize high-quality facial images with rigorous adherence to complex, multi-faceted textual descriptions, marking a significant advancement in text-to-facial attribute synthesis. Our code and model will be made publicly available at https://github.com/sunmeng7/AttriDiffuser.
{"title":"AttriDiffuser: Adversarially enhanced diffusion model for text-to-facial attribute image synthesis","authors":"Wenfeng Song , Zhongyong Ye , Meng Sun , Xia Hou , Shuai Li , Aimin Hao","doi":"10.1016/j.patcog.2025.111447","DOIUrl":"10.1016/j.patcog.2025.111447","url":null,"abstract":"<div><div>In the progressive domain of computer vision, generating high-fidelity facial images from textual descriptions with precision remains a complex challenge. While existing diffusion models have demonstrated capabilities in text-to-image synthesis, they often struggle with capturing intricate details from complex, multi-attribute textual descriptions, leading to entity or attribute loss and inaccurate combinations. We propose AttriDiffuser, a novel model designed to ensure that each entity and attribute in textual descriptions is distinctly and accurately represented in the synthesized images. AttriDiffuser utilizes a text-driven attribute diffusion adversarial model, enhancing the correspondence between textual attributes and image features. It incorporates an attribute-gating cross-attention mechanism seamlessly into the adversarial learning enhanced diffusion model. AttriDiffuser advances traditional diffusion models by integrating a face diversity discriminator, which augments adversarial training and promotes the generation of diverse yet precise facial images in alignment with complex textual descriptions. Our empirical evaluation, conducted on the renowned Multimodal VoxCeleb and CelebA-HQ datasets, and benchmarked against other state-of-the-art models, demonstrates AttriDiffuser’s superior efficacy. The results indicate its unparalleled capability to synthesize high-quality facial images with rigorous adherence to complex, multi-faceted textual descriptions, marking a significant advancement in text-to-facial attribute synthesis. Our code and model will be made publicly available at <span><span>https://github.com/sunmeng7/AttriDiffuser</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"163 ","pages":"Article 111447"},"PeriodicalIF":7.5,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143464532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}