Pub Date : 2025-12-20DOI: 10.1016/j.inffus.2025.104091
Luping Li , Xing Su , Han Lin , Haoying Han , Chao Fan , Zhao Zhang , Hongzhe Yue
Architectural design, a complex optimization process requiring iterative revisions by skilled architects, increasingly leverages computational tools. While deep generative models show promise in automating floorplan generation, two key limitations persist: (1) reliance on domain expertise, creating high technical barriers for non-experts, and (2) lack of iterative refinement capabilities, limiting post-generation adjustments. To address these challenges, we propose ChatAssistDesign, an interactive text-driven framework combining (1) Floorplan Designer, a large language model (LLM) agent guiding users through design workflows, and (2) ConDiffPlan, a vector-based conditional diffusion model for layout generation. Extensive experimental results demonstrate that our framework achieves significant improvements over state-of-the-art methods in terms of layout diversity, visual realism, text-to-layout alignment accuracy, and crucially, the ability to support iterative refinement while maintaining high robustness against constraint conflicts. By abstracting design complexity from user skill and enabling dynamic post hoc edits, our approach reduces entry barriers and improves integration with downstream tasks.
{"title":"ChatAssistDesign: A language-interactive framework for iterative vector floorplan generation via conditional diffusion","authors":"Luping Li , Xing Su , Han Lin , Haoying Han , Chao Fan , Zhao Zhang , Hongzhe Yue","doi":"10.1016/j.inffus.2025.104091","DOIUrl":"10.1016/j.inffus.2025.104091","url":null,"abstract":"<div><div>Architectural design, a complex optimization process requiring iterative revisions by skilled architects, increasingly leverages computational tools. While deep generative models show promise in automating floorplan generation, two key limitations persist: (1) reliance on domain expertise, creating high technical barriers for non-experts, and (2) lack of iterative refinement capabilities, limiting post-generation adjustments. To address these challenges, we propose ChatAssistDesign, an interactive text-driven framework combining (1) Floorplan Designer, a large language model (LLM) agent guiding users through design workflows, and (2) ConDiffPlan, a vector-based conditional diffusion model for layout generation. Extensive experimental results demonstrate that our framework achieves significant improvements over state-of-the-art methods in terms of layout diversity, visual realism, text-to-layout alignment accuracy, and crucially, the ability to support iterative refinement while maintaining high robustness against constraint conflicts. By abstracting design complexity from user skill and enabling dynamic post hoc edits, our approach reduces entry barriers and improves integration with downstream tasks.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104091"},"PeriodicalIF":15.5,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145796207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-20DOI: 10.1016/j.inffus.2025.104076
Chi Li , Haowen Jiang , Ruitao Zhou , Ye Dou , Zishun Shen , Lianmin Zhang , Xiongwen Qian , Jianfeng Mao
Predicting flight delays is crucial for enhancing operational efficiency, improving passenger satisfaction, and optimizing resource allocation within the aviation industry. Despite numerous methods and technologies available in this field, current approaches largely rely on complex feature engineering and sampling techniques, and they do not thoroughly explore the core influencing factors of flight delays. To address the myriad challenges in predicting flight delays, we propose the Hypergraph Attention and Periodic Fusion Learning (HAPFL) framework. Our model comprises modules for hypergraph construction, O-D driven graph attention, multi-view flight embedding, and a period-aware sequential transformer. This holistic approach enables a thorough analysis of the micro and macro integration of flight node representations and, through periodic feature extraction, predicts the delay status of flights over multiple future days. Tested on several real-world datasets, our model consistently outperforms current state-of-the-art baseline models, achieving competitive results across all four classification metrics, demonstrating superior overall predictive performance and the effective learning capabilities of its well-designed modules. Our model innovatively captures high-order relationships between flights, significantly enhancing future delay predictions, and contributing to a deeper understanding of delay mechanisms and more effective flight schedule management.
{"title":"Hypergraph attention and periodic fusion learning for enhanced flight delay prediction","authors":"Chi Li , Haowen Jiang , Ruitao Zhou , Ye Dou , Zishun Shen , Lianmin Zhang , Xiongwen Qian , Jianfeng Mao","doi":"10.1016/j.inffus.2025.104076","DOIUrl":"10.1016/j.inffus.2025.104076","url":null,"abstract":"<div><div>Predicting flight delays is crucial for enhancing operational efficiency, improving passenger satisfaction, and optimizing resource allocation within the aviation industry. Despite numerous methods and technologies available in this field, current approaches largely rely on complex feature engineering and sampling techniques, and they do not thoroughly explore the core influencing factors of flight delays. To address the myriad challenges in predicting flight delays, we propose the Hypergraph Attention and Periodic Fusion Learning (HAPFL) framework. Our model comprises modules for hypergraph construction, O-D driven graph attention, multi-view flight embedding, and a period-aware sequential transformer. This holistic approach enables a thorough analysis of the micro and macro integration of flight node representations and, through periodic feature extraction, predicts the delay status of flights over multiple future days. Tested on several real-world datasets, our model consistently outperforms current state-of-the-art baseline models, achieving competitive results across all four classification metrics, demonstrating superior overall predictive performance and the effective learning capabilities of its well-designed modules. Our model innovatively captures high-order relationships between flights, significantly enhancing future delay predictions, and contributing to a deeper understanding of delay mechanisms and more effective flight schedule management.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"129 ","pages":"Article 104076"},"PeriodicalIF":15.5,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-20DOI: 10.1016/j.inffus.2025.104084
Xingyu Zhao , Jianpeng Qi , Bin Lu , Lei Zhou , Lei Cao , Junyu Dong , Yanwei Yu
Data preparation is crucial for achieving optimal results in deep learning. Unfortunately, missing values are common when preparing large-scale spatiotemporal databases. Most existing imputation methods primarily focus on exploring the spatiotemporal correlations of single-source data; however, high missing rates in single-source data result in sparse distributions. Furthermore, existing methods typically focus on shallow correlations at a single scale, limiting the ability of imputation models to effectively leverage multi-scale spatial features. To tackle these challenges, we propose a multivariate dependency-aware spatiotemporal imputation model, named ST-Imputer. Specifically, we introduce multi-source context data to provide sufficient correlation features for target data (i.e., data that needs imputation), alleviating the issue of insufficient available features caused by high missing rates in single-source data. By applying a multi-variate spatiotemporal dependency extraction module, ST-Imputer captures potential associations between different spatial scales. Subsequently, the noise prediction module utilizes the learned dual-view features to formulate the spatiotemporal transmission module, thereby reducing weight errors caused by excessive noise. Finally, physical constraints are applied to prevent unrealistic predictions. Extensive experiments on three large-scale datasets demonstrate the significant superiority of ST-Imputer, achieving up to a 13.07 % improvement in RMSE. The code of our model is available at https://github.com/Lion1a/ST-Imputer.
{"title":"ST-Imputer: Multivariate dependency-aware diffusion network with physics guidance for spatiotemporal imputation","authors":"Xingyu Zhao , Jianpeng Qi , Bin Lu , Lei Zhou , Lei Cao , Junyu Dong , Yanwei Yu","doi":"10.1016/j.inffus.2025.104084","DOIUrl":"10.1016/j.inffus.2025.104084","url":null,"abstract":"<div><div>Data preparation is crucial for achieving optimal results in deep learning. Unfortunately, missing values are common when preparing large-scale spatiotemporal databases. Most existing imputation methods primarily focus on exploring the spatiotemporal correlations of single-source data; however, high missing rates in single-source data result in sparse distributions. Furthermore, existing methods typically focus on shallow correlations at a single scale, limiting the ability of imputation models to effectively leverage multi-scale spatial features. To tackle these challenges, we propose a multivariate dependency-aware spatiotemporal imputation model, named ST-Imputer. Specifically, we introduce multi-source context data to provide sufficient correlation features for target data (<em>i.e</em>., data that needs imputation), alleviating the issue of insufficient available features caused by high missing rates in single-source data. By applying a multi-variate spatiotemporal dependency extraction module, ST-Imputer captures potential associations between different spatial scales. Subsequently, the noise prediction module utilizes the learned dual-view features to formulate the spatiotemporal transmission module, thereby reducing weight errors caused by excessive noise. Finally, physical constraints are applied to prevent unrealistic predictions. Extensive experiments on three large-scale datasets demonstrate the significant superiority of ST-Imputer, achieving up to a 13.07 % improvement in RMSE. The code of our model is available at <span><span>https://github.com/Lion1a/ST-Imputer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104084"},"PeriodicalIF":15.5,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145796204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1016/j.inffus.2025.104073
Han Feng , Pengyang Song , Yinuo Ren , Hanfeng Zhou , Jue Wang
Regression ensembles, a competitive machine learning technique, have gained popularity in recent years. Popular ensemble schemes have evolved from equal weights (EWs), which utilize simple averages, to optimal weights (OWs), which optimize weights by minimizing mean squared error (MSE). Extensive research has not only validated the robustness of EWs but also introduced the concept of shrinkage, shrinking OWs towards EWs. This paper tackles the ensemble challenge through diversity theory, where ensemble MSE is decomposed into two components: global error and global diversity. Within the decomposition framework, OWs typically minimize global error at the expense of reduced global diversity, while EWs tend to maximize global diversity but often ignore the accuracy. To address the accuracy-diversity trade-off, we derive an optimal shrinkage factor that manages to minimize the ensemble MSE. Simulation results reveal the mediation effect of shrinkage weights, and empirical experiments on six UCI datasets and Brent monthly future prices demonstrate the superiority of the proposed method, whose mechanism is further expounded through an in-depth analysis of the shrinkage components. Overall, our approach provides a novel perspective on the efficacy of shrinkage in regression ensembles.
{"title":"Shrinkage matters: evidence from accuracy-diversity trade-off in regression ensembles","authors":"Han Feng , Pengyang Song , Yinuo Ren , Hanfeng Zhou , Jue Wang","doi":"10.1016/j.inffus.2025.104073","DOIUrl":"10.1016/j.inffus.2025.104073","url":null,"abstract":"<div><div>Regression ensembles, a competitive machine learning technique, have gained popularity in recent years. Popular ensemble schemes have evolved from equal weights (EWs), which utilize simple averages, to optimal weights (OWs), which optimize weights by minimizing mean squared error (MSE). Extensive research has not only validated the robustness of EWs but also introduced the concept of shrinkage, shrinking OWs towards EWs. This paper tackles the ensemble challenge through diversity theory, where ensemble MSE is decomposed into two components: global error and global diversity. Within the decomposition framework, OWs typically minimize global error at the expense of reduced global diversity, while EWs tend to maximize global diversity but often ignore the accuracy. To address the accuracy-diversity trade-off, we derive an optimal shrinkage factor that manages to minimize the ensemble MSE. Simulation results reveal the mediation effect of shrinkage weights, and empirical experiments on six UCI datasets and Brent monthly future prices demonstrate the superiority of the proposed method, whose mechanism is further expounded through an in-depth analysis of the shrinkage components. Overall, our approach provides a novel perspective on the efficacy of shrinkage in regression ensembles.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104073"},"PeriodicalIF":15.5,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1016/j.inffus.2025.104057
Jianfeng He , Linlin Yu , Changbin Li , Runing Yang , Fanglan Chen , Kangshuo Li , Min Zhang , Shuo Lei , Xuchao Zhang , Mohammad Beigi , Kaize Ding , Bei Xiao , Lifu Huang , Feng Chen , Ming Jin , Chang-Tien Lu
Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of domains. However, inaccuracies in their outputs can lead to severe consequences in high-stakes areas such as finance and healthcare, where errors may result in the loss of money, time, or even lives. As a result, recent research has increasingly focused on uncertainty estimation in LLMs, aiming to quantify the trustworthiness of model-generated content given specific inputs. Despite this growing interest, the sources of uncertainty in LLMs remain insufficiently understood. As a result, this survey provides a comprehensive overview of uncertainty estimation for LLMs from the perspective of uncertainty sources, serving as a foundational resource for researchers entering the field. We begin by reviewing essential background on LLMs, followed by a detailed clarification of uncertainty sources relevant to them. We then introduce various uncertainty estimation methods, including both commonly used and LLM-specific approaches. Metrics for evaluating uncertainty are discussed, along with key application areas. Finally, we highlight major challenges and outline future research directions aimed at improving the trustworthiness and reliability of LLMs.
{"title":"Survey of uncertainty estimation in LLMs - Sources, methods, applications, and challenges","authors":"Jianfeng He , Linlin Yu , Changbin Li , Runing Yang , Fanglan Chen , Kangshuo Li , Min Zhang , Shuo Lei , Xuchao Zhang , Mohammad Beigi , Kaize Ding , Bei Xiao , Lifu Huang , Feng Chen , Ming Jin , Chang-Tien Lu","doi":"10.1016/j.inffus.2025.104057","DOIUrl":"10.1016/j.inffus.2025.104057","url":null,"abstract":"<div><div>Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of domains. However, inaccuracies in their outputs can lead to severe consequences in high-stakes areas such as finance and healthcare, where errors may result in the loss of money, time, or even lives. As a result, recent research has increasingly focused on uncertainty estimation in LLMs, aiming to quantify the trustworthiness of model-generated content given specific inputs. Despite this growing interest, the sources of uncertainty in LLMs remain insufficiently understood. As a result, this survey provides a comprehensive overview of uncertainty estimation for LLMs from the perspective of uncertainty sources, serving as a foundational resource for researchers entering the field. We begin by reviewing essential background on LLMs, followed by a detailed clarification of uncertainty sources relevant to them. We then introduce various uncertainty estimation methods, including both commonly used and LLM-specific approaches. Metrics for evaluating uncertainty are discussed, along with key application areas. Finally, we highlight major challenges and outline future research directions aimed at improving the trustworthiness and reliability of LLMs.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104057"},"PeriodicalIF":15.5,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1016/j.inffus.2025.104072
Abeer A. Wafa , Marwa S. Farhan , Mai M. Eldefrawi
Amid the growing demand for emotionally intelligent systems, Multimodal Emotion Recognition (MER) has emerged as a critical frontier in affective computing. However, achieving reliable generalization across heterogeneous data sources and ensuring semantic alignment across diverse modalities remain unresolved challenges. So, this research presents a novel and unified framework for MER that unfolds in five coordinated stages: modality-specific cross-dataset pretraining, diffusion-based generative data augmentation, reinforcement learning-driven hyperparameter optimization, latent space alignment, and task-aware multimodal fusion with fine-tuning. Each modality-text, audio, video, and motion-is initially pretrained using large-scale, emotion-labeled corpora to extract domain-invariant affective features. A generative augmentation stage that uses diffusion models increases sample diversity and improves class balance. Hyperparameter scheduling is governed by a Proximal Policy Optimization (PPO) agent that dynamically adjusts learning parameters during both pretraining and fine-tuning phases. Latent space alignment is achieved through a combination of domain-adversarial objectives, statistical regularization (e.g., MMD, CCA), and prototypical contrastive learning. The fusion strategy integrates Cross-Attentional Modality Interaction (CAMI), Bidirectional Alignment Networks (BAN), Gaussian Mixture Interaction Modules (GMIM), and Neural Variational Mixture-of-Experts (NV-MoE) to support context-aware and uncertainty-resilient emotion inference.
Empirical evaluations on MELD, IEMOCAP, and SAVEE demonstrate exceptional performance. Test accuracies reached 99.91 %, 99.87 %, and 99.52 % respectively, with minimal losses ( ≤ 0.000056) and inference latencies between 0.02-0.07 ms. Post-alignment diagnostics across 100 runs revealed highly stable latent embeddings (Silhouette: 0.960-0.980, CKA: 0.970-0.990), confirming strong cross-modal coherence. Zero-shot testing on external unseen datasets (GoEmotions, CREMA-D, EmotiW, HUMAINE) yielded accuracies above 99.90 %, demonstrating robust generalization without fine-tuning. Even though the model is trained on batch data, the deployment through ONNX ensures adaptability for real-time emotion recognition in resource-constrained environments. These findings establish the proposed system as a highly performant and deployable solution for multimodal affect analysis.
{"title":"A unified framework for multimodal emotion recognition across homogeneous and heterogeneous modalities with adaptive fusion","authors":"Abeer A. Wafa , Marwa S. Farhan , Mai M. Eldefrawi","doi":"10.1016/j.inffus.2025.104072","DOIUrl":"10.1016/j.inffus.2025.104072","url":null,"abstract":"<div><div>Amid the growing demand for emotionally intelligent systems, Multimodal Emotion Recognition (MER) has emerged as a critical frontier in affective computing. However, achieving reliable generalization across heterogeneous data sources and ensuring semantic alignment across diverse modalities remain unresolved challenges. So, this research presents a novel and unified framework for MER that unfolds in five coordinated stages: modality-specific cross-dataset pretraining, diffusion-based generative data augmentation, reinforcement learning-driven hyperparameter optimization, latent space alignment, and task-aware multimodal fusion with fine-tuning. Each modality-text, audio, video, and motion-is initially pretrained using large-scale, emotion-labeled corpora to extract domain-invariant affective features. A generative augmentation stage that uses diffusion models increases sample diversity and improves class balance. Hyperparameter scheduling is governed by a Proximal Policy Optimization (PPO) agent that dynamically adjusts learning parameters during both pretraining and fine-tuning phases. Latent space alignment is achieved through a combination of domain-adversarial objectives, statistical regularization (e.g., MMD, CCA), and prototypical contrastive learning. The fusion strategy integrates Cross-Attentional Modality Interaction (CAMI), Bidirectional Alignment Networks (BAN), Gaussian Mixture Interaction Modules (GMIM), and Neural Variational Mixture-of-Experts (NV-MoE) to support context-aware and uncertainty-resilient emotion inference.</div><div>Empirical evaluations on MELD, IEMOCAP, and SAVEE demonstrate exceptional performance. Test accuracies reached 99.91 %, 99.87 %, and 99.52 % respectively, with minimal losses ( ≤ 0.000056) and inference latencies between 0.02-0.07 ms. Post-alignment diagnostics across 100 runs revealed highly stable latent embeddings (Silhouette: 0.960-0.980, CKA: 0.970-0.990), confirming strong cross-modal coherence. Zero-shot testing on external unseen datasets (GoEmotions, CREMA-D, EmotiW, HUMAINE) yielded accuracies above 99.90 %, demonstrating robust generalization without fine-tuning. Even though the model is trained on batch data, the deployment through ONNX ensures adaptability for real-time emotion recognition in resource-constrained environments. These findings establish the proposed system as a highly performant and deployable solution for multimodal affect analysis.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"129 ","pages":"Article 104072"},"PeriodicalIF":15.5,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1016/j.inffus.2025.104069
Mingyue Li , Yinghao Zhang , Ruizhong Du , Chunfu Jia , Xiaoyun Guang
The proliferation of digital portraits and the widespread adoption of advanced Face Recognition (FR) systems pose significant privacy threats, rendering the protection of facial identities paramount. However, existing methods face a universal challenge in balancing protection efficacy with visual fidelity: diffusion-based approaches often suffer from diminished protection due to their inherent purification effects, while standalone Spatial Transformation Perturbations (STPs) risk distorting critical facial features and often yield insufficient protection efficacy. To address these limitations, this paper introduces STP-Diff, a synergistic fusion method that integrates spatial and additive perturbations via a region-differentiated perturbation strategy. Specifically, our method applies non-additive spatial perturbations to non-salient regions as a pre-perturbation to resist the diffusion purification effect, thereby providing a more advantageous starting point for the subsequent diffusion model optimization. Building on this foundation, the method concentrates the potent generative capabilities of diffusion models onto identity-critical regions to generate effective additive perturbations for targeted protection. By strategically deploying spatial transformations, a largely under-explored technique in the facial privacy protection domain, our synergistic fusion strategy significantly enhances protection efficacy while achieving excellent visual quality. Extensive experiments on public datasets demonstrate that our method exhibits superior facial privacy protection in black-box targeted scenarios, achieving an average Protection Success Rate (PSR) of 81.09 % and a favorable Fréchet Inception Distance (FID) of 8.79, and demonstrates robust transferability against commercial Face Recognition platforms.
{"title":"STP-Diff: Synergistic fusion of spatial transformation perturbations and diffusion models for robust face privacy protection","authors":"Mingyue Li , Yinghao Zhang , Ruizhong Du , Chunfu Jia , Xiaoyun Guang","doi":"10.1016/j.inffus.2025.104069","DOIUrl":"10.1016/j.inffus.2025.104069","url":null,"abstract":"<div><div>The proliferation of digital portraits and the widespread adoption of advanced Face Recognition (FR) systems pose significant privacy threats, rendering the protection of facial identities paramount. However, existing methods face a universal challenge in balancing protection efficacy with visual fidelity: diffusion-based approaches often suffer from diminished protection due to their inherent purification effects, while standalone Spatial Transformation Perturbations (STPs) risk distorting critical facial features and often yield insufficient protection efficacy. To address these limitations, this paper introduces STP-Diff, a synergistic fusion method that integrates spatial and additive perturbations via a region-differentiated perturbation strategy. Specifically, our method applies non-additive spatial perturbations to non-salient regions as a pre-perturbation to resist the diffusion purification effect, thereby providing a more advantageous starting point for the subsequent diffusion model optimization. Building on this foundation, the method concentrates the potent generative capabilities of diffusion models onto identity-critical regions to generate effective additive perturbations for targeted protection. By strategically deploying spatial transformations, a largely under-explored technique in the facial privacy protection domain, our synergistic fusion strategy significantly enhances protection efficacy while achieving excellent visual quality. Extensive experiments on public datasets demonstrate that our method exhibits superior facial privacy protection in black-box targeted scenarios, achieving an average Protection Success Rate (PSR) of 81.09 % and a favorable Fréchet Inception Distance (FID) of 8.79, and demonstrates robust transferability against commercial Face Recognition platforms.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"129 ","pages":"Article 104069"},"PeriodicalIF":15.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1016/j.inffus.2025.104071
Giuseppe De Simone , Luca Greco , Alessia Saggese , Mario Vento
Gender and emotion recognition are traditionally analyzed independently using audio and video modalities, which introduces challenges when fusing their outputs and often results in increased computational overhead and latency. To address these limitations, in this work we introduces MAGNET (Multimodal Architecture for GeNder and Emotion Tasks), a novel multimodal multitask learning framework that jointly performs gender and emotion recognition by simultaneously analyzing audio and visual inputs. MAGNET employs soft parameter sharing, guided by GradNorm to balance task-specific learning dynamics. This design not only enhances recognition accuracy through effective modality fusion but also reduces model complexity by leveraging multitask learning. As a result, our approach is particularly well-suited for deployment on embedded devices, where computational efficiency and responsiveness are critical. Evaluated on the CREMA-D dataset, MAGNET consistently outperforms unimodal baselines and current state-of-the-art methods, demonstrating its effectiveness for efficient and accurate soft biometric analysis.
{"title":"Integrating visual and audio cues for emotion and gender recognition: A multi modal and multi task approach","authors":"Giuseppe De Simone , Luca Greco , Alessia Saggese , Mario Vento","doi":"10.1016/j.inffus.2025.104071","DOIUrl":"10.1016/j.inffus.2025.104071","url":null,"abstract":"<div><div>Gender and emotion recognition are traditionally analyzed independently using audio and video modalities, which introduces challenges when fusing their outputs and often results in increased computational overhead and latency. To address these limitations, in this work we introduces MAGNET (Multimodal Architecture for GeNder and Emotion Tasks), a novel multimodal multitask learning framework that jointly performs gender and emotion recognition by simultaneously analyzing audio and visual inputs. MAGNET employs soft parameter sharing, guided by GradNorm to balance task-specific learning dynamics. This design not only enhances recognition accuracy through effective modality fusion but also reduces model complexity by leveraging multitask learning. As a result, our approach is particularly well-suited for deployment on embedded devices, where computational efficiency and responsiveness are critical. Evaluated on the CREMA-D dataset, MAGNET consistently outperforms unimodal baselines and current state-of-the-art methods, demonstrating its effectiveness for efficient and accurate soft biometric analysis.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104071"},"PeriodicalIF":15.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1016/j.inffus.2025.104075
Xingyu Shen , Jinshi Xiao , Xiang Zhang , Long Lan , Xinwang Liu
Video Moment Retrieval (VMR) aims to identify the temporal span in an untrimmed video that semantically corresponds to a natural language query. Existing methods often overlook temporal invariance, making them sensitive to variations in query span and limiting their performance, especially for retrieving short-span moments. To address this limitation, we propose a Span-aware Temporal Aggregation (STA) network that introduces span-aware features to capture temporal invariant patterns, thereby enhancing robustness to varying query spans. STA consists of two key components: (i) A span-aware feature aggregation (SFA) module constructs span-specific visual representations that are aligned with the query to generate span-aware features, which are then integrated into local candidate moments; (ii) a Query-guided Moment Reasoning (QMR) module, which dynamically adapts the receptive fields of temporal convolutions based on query span semantics to achieve fine-grained reasoning. Extensive experiments on three challenging benchmark datasets demonstrate that STA consistently outperforms state-of-the-art methods, with particularly notable gains for short-span moments.
视频时刻检索(Video Moment Retrieval, VMR)的目的是识别在语义上与自然语言查询相对应的未修剪视频的时间跨度。现有的方法经常忽略时间不变性,使它们对查询范围的变化很敏感,限制了它们的性能,特别是在检索短跨度矩时。为了解决这一限制,我们提出了一个跨度感知的时间聚合(STA)网络,该网络引入了跨度感知的特征来捕获时间不变模式,从而增强了对不同查询跨度的鲁棒性。STA由两个关键组件组成:(i)跨度感知特征聚合(SFA)模块构建与查询对齐的特定于跨度的视觉表示,以生成跨度感知特征,然后将其集成到局部候选矩中;(ii)查询引导矩推理(query -guided Moment Reasoning, QMR)模块,基于查询跨度语义动态调整时间卷积的接受域,实现细粒度推理。在三个具有挑战性的基准数据集上进行的大量实验表明,STA始终优于最先进的方法,在短跨度矩方面的收益尤其显著。
{"title":"Span-aware temporal aggregation network for video moment retrieval","authors":"Xingyu Shen , Jinshi Xiao , Xiang Zhang , Long Lan , Xinwang Liu","doi":"10.1016/j.inffus.2025.104075","DOIUrl":"10.1016/j.inffus.2025.104075","url":null,"abstract":"<div><div>Video Moment Retrieval (VMR) aims to identify the temporal span in an untrimmed video that semantically corresponds to a natural language query. Existing methods often overlook temporal invariance, making them sensitive to variations in query span and limiting their performance, especially for retrieving short-span moments. To address this limitation, we propose a Span-aware Temporal Aggregation (STA) network that introduces span-aware features to capture temporal invariant patterns, thereby enhancing robustness to varying query spans. STA consists of two key components: (i) A span-aware feature aggregation (SFA) module constructs span-specific visual representations that are aligned with the query to generate span-aware features, which are then integrated into local candidate moments; (ii) a Query-guided Moment Reasoning (QMR) module, which dynamically adapts the receptive fields of temporal convolutions based on query span semantics to achieve fine-grained reasoning. Extensive experiments on three challenging benchmark datasets demonstrate that STA consistently outperforms state-of-the-art methods, with particularly notable gains for short-span moments.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"130 ","pages":"Article 104075"},"PeriodicalIF":15.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1016/j.inffus.2025.104068
Rui Pan , Haoran Luo , Quan Yuan , Guiyang Luo , Jinglin Li , Tiesunlong Shen , Rui Mao , Erik Cambria
Multitask reinforcement learning aims to train a unified policy that generalizes across multiple related tasks, improving sample efficiency and promoting knowledge transfer. However, existing methods often suffer from negative knowledge transfer due to task interference, especially when using hard parameter sharing across tasks with diverse dynamics or goals. Conventional solutions typically adopt shared backbones with task-specific heads, gradient projection methods, or routing-based networks to mitigate conflict. However, many of these methods rely on simplistic task identifiers (e.g., one-hot vectors), lack expressive representations of task semantics, or fail to modulate shared components in a fine-grained, task-specific manner. To overcome these challenges, we propose Metadata-guided Adaptive Routing (MetaAR), a novel framework that incorporates rich task metadata such as natural language descriptions to generate expressive and interpretable task representations. These representations are injected into a dynamic routing network, which adaptively reconfigures layer-wise computation paths in a shared modular policy network. To enable robust task-specific adaptation, we further introduce a noise-injected Top-K routing mechanism that dynamically selects the most relevant computation paths for each task. By injecting stochasticity during routing, this mechanism promotes exploration and mitigates interference between tasks through sparse, selective information flow. We evaluate MetaAR on the Meta-World benchmark with up to 50 robotic manipulation tasks, where it consistently outperforms strong baselines, achieving 4–8 % higher mean success rates than the best-performing methods across the MT10 and MT50 variants.
{"title":"Multitask reinforcement learning with metadata-guided adaptive routing","authors":"Rui Pan , Haoran Luo , Quan Yuan , Guiyang Luo , Jinglin Li , Tiesunlong Shen , Rui Mao , Erik Cambria","doi":"10.1016/j.inffus.2025.104068","DOIUrl":"10.1016/j.inffus.2025.104068","url":null,"abstract":"<div><div>Multitask reinforcement learning aims to train a unified policy that generalizes across multiple related tasks, improving sample efficiency and promoting knowledge transfer. However, existing methods often suffer from negative knowledge transfer due to task interference, especially when using hard parameter sharing across tasks with diverse dynamics or goals. Conventional solutions typically adopt shared backbones with task-specific heads, gradient projection methods, or routing-based networks to mitigate conflict. However, many of these methods rely on simplistic task identifiers (e.g., one-hot vectors), lack expressive representations of task semantics, or fail to modulate shared components in a fine-grained, task-specific manner. To overcome these challenges, we propose <strong>Meta</strong>data-guided <strong>A</strong>daptive <strong>R</strong>outing (<strong>MetaAR</strong>), a novel framework that incorporates rich task metadata such as natural language descriptions to generate expressive and interpretable task representations. These representations are injected into a dynamic routing network, which adaptively reconfigures layer-wise computation paths in a shared modular policy network. To enable robust task-specific adaptation, we further introduce a noise-injected Top-K routing mechanism that dynamically selects the most relevant computation paths for each task. By injecting stochasticity during routing, this mechanism promotes exploration and mitigates interference between tasks through sparse, selective information flow. We evaluate MetaAR on the Meta-World benchmark with up to 50 robotic manipulation tasks, where it consistently outperforms strong baselines, achieving 4–8 % higher mean success rates than the best-performing methods across the MT10 and MT50 variants.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"129 ","pages":"Article 104068"},"PeriodicalIF":15.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}