Pub Date : 2025-12-04DOI: 10.1109/LSP.2025.3640519
Ziyang Jiang;Xueyan Chen;Shuai Wang;Xinyuan Qian;Haizhou Li
In complex multi-speaker scenarios with significant speaker overlap and background noise, extracting the target speaker's speech remains a major challenge. This capability is crucial for dialogue-based applications such as AI speech assistants, where downstream tasks such as speech recognition depend on clean speech. A potential solution to address these challenges is Target Speaker Extraction (TSE), which leverages auxiliary information to extract target speech from mixed and noisy speech, thus overcoming the limitations of Speech Separation (SS) and Speech Enhancement (SE). In particular, we propose a multi-modal TSE network, namely Text Prompt Extractor with echo cue block (TPEech), which uses historical dialogue text as cues for extraction and incorporates the echo cue block (ECB) to further exploit this cue and enhance TSE performance. The experiments show the excellent extraction and denoising capabilities of our proposed network. TPEech achieves an SI-SDRi of 9.632 dB, an SDR of 13.045 dB, a PESQ of 2.814, and a STOI of 0.885, outperforming competitive baselines. Additionally, we experimentally verify that TPEech is robust against semantically incomplete textual prompts.
{"title":"TPEech: Target Speaker Extraction and Noise Suppression With Historical Dialogue Text Cues","authors":"Ziyang Jiang;Xueyan Chen;Shuai Wang;Xinyuan Qian;Haizhou Li","doi":"10.1109/LSP.2025.3640519","DOIUrl":"https://doi.org/10.1109/LSP.2025.3640519","url":null,"abstract":"In complex multi-speaker scenarios with significant speaker overlap and background noise, extracting the target speaker's speech remains a major challenge. This capability is crucial for dialogue-based applications such as AI speech assistants, where downstream tasks such as speech recognition depend on clean speech. A potential solution to address these challenges is Target Speaker Extraction (TSE), which leverages auxiliary information to extract target speech from mixed and noisy speech, thus overcoming the limitations of Speech Separation (SS) and Speech Enhancement (SE). In particular, we propose a multi-modal TSE network, namely Text Prompt Extractor with echo cue block (TPEech), which uses historical dialogue text as cues for extraction and incorporates the echo cue block (ECB) to further exploit this cue and enhance TSE performance. The experiments show the excellent extraction and denoising capabilities of our proposed network. TPEech achieves an SI-SDRi of 9.632 dB, an SDR of 13.045 dB, a PESQ of 2.814, and a STOI of 0.885, outperforming competitive baselines. Additionally, we experimentally verify that TPEech is robust against semantically incomplete textual prompts.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"351-355"},"PeriodicalIF":3.9,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Monocular depth estimation is essential for 3D perception in applications such as autonomous driving and robotics. Self-supervised methods avoid depth labels but often rely on shallow pose networks with weak temporal modeling, leading to unstable predictions. We propose EMSP-Net, an Enhanced Multi-Scale PoseNet for self-supervised monocular depth estimation. It introduces a hierarchical feature fusion encoder, a temporal attention-context decoder, and a pose consistency loss to jointly improve feature extraction, temporal stability, and geometric constraints. On the KITTI dataset, EMSP-Net achieved an absolute relative error of 0.105 and a squared relative error of 0.708. In the Make3D cross-domain test, its strong robustness was further demonstrated.
{"title":"Enhanced Multi-Scale PoseNet for Self-Supervised Monocular Depth Estimation","authors":"Chao Zhang;Tian Tian;Cheng Han;Tiancheng Shao;Mi Zhou;Shichao Zhao","doi":"10.1109/LSP.2025.3639361","DOIUrl":"https://doi.org/10.1109/LSP.2025.3639361","url":null,"abstract":"Monocular depth estimation is essential for 3D perception in applications such as autonomous driving and robotics. Self-supervised methods avoid depth labels but often rely on shallow pose networks with weak temporal modeling, leading to unstable predictions. We propose EMSP-Net, an Enhanced Multi-Scale PoseNet for self-supervised monocular depth estimation. It introduces a hierarchical feature fusion encoder, a temporal attention-context decoder, and a pose consistency loss to jointly improve feature extraction, temporal stability, and geometric constraints. On the KITTI dataset, EMSP-Net achieved an absolute relative error of 0.105 and a squared relative error of 0.708. In the Make3D cross-domain test, its strong robustness was further demonstrated.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"316-320"},"PeriodicalIF":3.9,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-02DOI: 10.1109/LSP.2025.3639352
Zhengyi Liu;Jiali Wu;Xianyong Fang;Linbo Wang
Text-driven medical image segmentation aims to accurately segment pathological regions in medical images based on textual descriptions. Existing methods face two major challenges: (a) The significant modality heterogeneity between textual and visual features leads to inefficient cross-modal feature alignment; (b) The insufficient utilization of medical shared knowledge restricts semantic understanding. To address these challenges, two large language model (LLM) bridges are constructed. LLM semantic bridge leverages the sequential modeling capability of a frozen LLM to reorganize visual features into semantically coherent units that possess linguistic logic, thereby effectively bridging vision and language. The LLM prompt bridge appends learnable prompts, which encode medical shared knowledge from the LLM, to text embeddings, thereby effectively bridging case-specificity and medical consensus knowledge. Experimental results show the predominant performance due to LLM participation.
{"title":"Text-Driven Medical Image Segmentation With LLM Semantic Bridge and LLM Prompt Bridge","authors":"Zhengyi Liu;Jiali Wu;Xianyong Fang;Linbo Wang","doi":"10.1109/LSP.2025.3639352","DOIUrl":"https://doi.org/10.1109/LSP.2025.3639352","url":null,"abstract":"Text-driven medical image segmentation aims to accurately segment pathological regions in medical images based on textual descriptions. Existing methods face two major challenges: (a) The significant modality heterogeneity between textual and visual features leads to inefficient cross-modal feature alignment; (b) The insufficient utilization of medical shared knowledge restricts semantic understanding. To address these challenges, two large language model (LLM) bridges are constructed. LLM semantic bridge leverages the sequential modeling capability of a frozen LLM to reorganize visual features into semantically coherent units that possess linguistic logic, thereby effectively bridging vision and language. The LLM prompt bridge appends learnable prompts, which encode medical shared knowledge from the LLM, to text embeddings, thereby effectively bridging case-specificity and medical consensus knowledge. Experimental results show the predominant performance due to LLM participation.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"146-150"},"PeriodicalIF":3.9,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145772018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the joint problem of online experiment design and parameter estimation for identifying nonlinear system models, while adhering to system constraints. We utilize a receding horizon approach and propose a new adaptive input design criterion, which is tailored to continuously updated parameter estimates, along with a new sequential estimator. We demonstrate the ability of the method to design informative experiments online, while steering the system within operational constraints.
{"title":"Adaptive Experiment Design for Nonlinear System Identification With Operational Constraints","authors":"Jingwei Hu;Dave Zachariah;Torbjörn Wigren;Petre Stoica","doi":"10.1109/LSP.2025.3639512","DOIUrl":"https://doi.org/10.1109/LSP.2025.3639512","url":null,"abstract":"We consider the joint problem of online experiment design and parameter estimation for identifying nonlinear system models, while adhering to system constraints. We utilize a receding horizon approach and propose a new adaptive input design criterion, which is tailored to continuously updated parameter estimates, along with a new sequential estimator. We demonstrate the ability of the method to design informative experiments online, while steering the system within operational constraints.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"151-155"},"PeriodicalIF":3.9,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145772065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Underwater object detection suffers from limited long-range dependency modeling, fine-grained feature representation, and noise suppression, resulting in blurred boundaries, frequent missed detections, and reduced robustness. To address these challenges, we propose the Mamba-Driven Feature Pyramid Decoding framework, which employs a parallel Feature Pyramid Network and Path Aggregation Network collaborative pathway to enhance semantic and geometric features. A lightweight Mamba Block models long-range dependencies, while an Adaptive Sparse Self-Attention module highlights discriminative targets and suppresses noise. Together, these components improve feature representation and robustness. Experiments on two publicly available underwater datasets demonstrate that MFPD significantly outperforms existing methods, validating its effectiveness in complex underwater environments. The code is publicly available at: https://github.com/YitengGuo/MFPD
{"title":"MFPD: Mamba-Driven Feature Pyramid Decoding for Underwater Object Detection","authors":"Yiteng Guo;Junpeng Xu;Jiali Wang;Wenyi Zhao;Weidong Zhang","doi":"10.1109/LSP.2025.3639347","DOIUrl":"https://doi.org/10.1109/LSP.2025.3639347","url":null,"abstract":"Underwater object detection suffers from limited long-range dependency modeling, fine-grained feature representation, and noise suppression, resulting in blurred boundaries, frequent missed detections, and reduced robustness. To address these challenges, we propose the Mamba-Driven Feature Pyramid Decoding framework, which employs a parallel Feature Pyramid Network and Path Aggregation Network collaborative pathway to enhance semantic and geometric features. A lightweight Mamba Block models long-range dependencies, while an Adaptive Sparse Self-Attention module highlights discriminative targets and suppresses noise. Together, these components improve feature representation and robustness. Experiments on two publicly available underwater datasets demonstrate that MFPD significantly outperforms existing methods, validating its effectiveness in complex underwater environments. The code is publicly available at: <uri>https://github.com/YitengGuo/MFPD</uri>","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"141-145"},"PeriodicalIF":3.9,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145772094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-02DOI: 10.1109/LSP.2025.3639375
Liusong Huang;Adam Amril bin Jaharadak;Nor Izzati Ahmad;Jie Wang;Dalin Zhang
Thermal power plants rely on extensive sensor networks to monitor key operational parameters, yet harsh industrial environments often lead to incomplete data characterized by significant noise, complex physical dependencies, and abrupt state transitions. This impedes accurate monitoring and predictive analyses. To address these domain-specific challenges, we propose a novel Hybrid Multi-Head Attention (HybridMHA) model for time series imputation. The core novelty of our approach lies in the synergistic combination of diagonally-masked self-attention and dynamic sparse attention. Specifically, the diagonally-masked component strictly preserves temporal causality to model the sequential evolution of plant states, while the dynamic sparse component selectively identifies critical cross-variable dependencies, effectively filtering out sensor noise. This tailored design enables the model to robustly capture sparse physical inter-dependencies even during abrupt operational shifts.Using a real-world dataset from a thermal power plant, our model demonstrates statistically significant improvements, outperforming existing methods by 10%–20% on key metrics. Further validation on a public benchmark dataset confirms its generalizability. These findings highlight the model's potential for robust real-time monitoring in complex industrial applications.
{"title":"Addressing Missing Data in Thermal Power Plant Monitoring With Hybrid Attention Time Series Imputation","authors":"Liusong Huang;Adam Amril bin Jaharadak;Nor Izzati Ahmad;Jie Wang;Dalin Zhang","doi":"10.1109/LSP.2025.3639375","DOIUrl":"https://doi.org/10.1109/LSP.2025.3639375","url":null,"abstract":"Thermal power plants rely on extensive sensor networks to monitor key operational parameters, yet harsh industrial environments often lead to incomplete data characterized by significant noise, complex physical dependencies, and abrupt state transitions. This impedes accurate monitoring and predictive analyses. To address these domain-specific challenges, we propose a novel Hybrid Multi-Head Attention (HybridMHA) model for time series imputation. The core novelty of our approach lies in the synergistic combination of diagonally-masked self-attention and dynamic sparse attention. Specifically, the diagonally-masked component strictly preserves temporal causality to model the sequential evolution of plant states, while the dynamic sparse component selectively identifies critical cross-variable dependencies, effectively filtering out sensor noise. This tailored design enables the model to robustly capture sparse physical inter-dependencies even during abrupt operational shifts.Using a real-world dataset from a thermal power plant, our model demonstrates statistically significant improvements, outperforming existing methods by 10%–20% on key metrics. Further validation on a public benchmark dataset confirms its generalizability. These findings highlight the model's potential for robust real-time monitoring in complex industrial applications.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"536-540"},"PeriodicalIF":3.9,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146026338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1109/LSP.2025.3638634
Yu Lian;Xinshan Zhu;Di He;Biao Sun;Ruyi Zhang
The rapid development and malicious use of deepfakes pose a significant crisis of trust. To cope with the evolving deepfake technologies, an increasing number of detection methods adopt the continual learning paradigm, but they often suffer from catastrophic forgetting. Although replay-based methods mitigate this issue by storing a portion of samples from historical tasks, their sample selection strategies usually rely on a single metric, which may lead to the omission of critical samples and consequently hinder the construction of a robust instance memory bank. In this letter, we propose a novel Multi-perspective Sample Selection Mechanism (MSSM) for continual deepfake detection, which jointly evaluates prediction error, temporal instability, and sample diversity to preserve informative and challenging samples in the instance memory bank. Furthermore, we design a Hierarchical Prototype Generation Mechanism (HPGM) that constructs prototypes at both the category and task levels, which are stored in the prototype memory bank. Extensive experiments under two evaluation protocols demonstrate that the proposed method achieves state-of-the-art performance.
{"title":"Continual Deepfake Detection Based on Multi-Perspective Sample Selection Mechanism","authors":"Yu Lian;Xinshan Zhu;Di He;Biao Sun;Ruyi Zhang","doi":"10.1109/LSP.2025.3638634","DOIUrl":"https://doi.org/10.1109/LSP.2025.3638634","url":null,"abstract":"The rapid development and malicious use of deepfakes pose a significant crisis of trust. To cope with the evolving deepfake technologies, an increasing number of detection methods adopt the continual learning paradigm, but they often suffer from catastrophic forgetting. Although replay-based methods mitigate this issue by storing a portion of samples from historical tasks, their sample selection strategies usually rely on a single metric, which may lead to the omission of critical samples and consequently hinder the construction of a robust instance memory bank. In this letter, we propose a novel Multi-perspective Sample Selection Mechanism (MSSM) for continual deepfake detection, which jointly evaluates prediction error, temporal instability, and sample diversity to preserve informative and challenging samples in the instance memory bank. Furthermore, we design a Hierarchical Prototype Generation Mechanism (HPGM) that constructs prototypes at both the category and task levels, which are stored in the prototype memory bank. Extensive experiments under two evaluation protocols demonstrate that the proposed method achieves state-of-the-art performance.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"131-135"},"PeriodicalIF":3.9,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advanced driving simulations are increasingly used in automated driving research, yet freely available data and tools remain limited. We present a new open-source framework for synthetic data generation for lane change (LC) intention recognition in highways. Built on the CARLA simulator, it advances the state-of-the-art by providing a 50-driver dataset, a large-scale 3D map, and code for reproducibility and new data creation. The 60 km highway map includes varying curvature radii and straight segments. The codebase supports simulation enhancements (traffic management, vehicle cockpit, engine noise) and Machine Learning (ML) model training and evaluation, including CARLA log post-processing into time series. The dataset contains over 3,400 annotated LC maneuvers with synchronized ego dynamics, road geometry, and traffic context. From an automotive industry perspective, we also assess leading-edge ML models on STM32 microcontrollers using deployability metrics. Unlike prior infrastructure-based works, we estimate time-to-LC from ego-centric data. Results show that a Transformer model yields the lowest regression error, while XGBoost offers the best trade-offs on extremely resource-constrained devices. The entire framework is publicly released to support advancement in automated driving research.
{"title":"A Deployment-Oriented Simulation Framework for Deep Learning-Based Lane Change Prediction","authors":"Luca Forneris;Riccardo Berta;Matteo Fresta;Luca Lazzaroni;Hadise Rojhan;Changjae Oh;Alessandro Pighetti;Hadi Ballout;Fabio Tango;Francesco Bellotti","doi":"10.1109/LSP.2025.3638676","DOIUrl":"https://doi.org/10.1109/LSP.2025.3638676","url":null,"abstract":"Advanced driving simulations are increasingly used in automated driving research, yet freely available data and tools remain limited. We present a new open-source framework for synthetic data generation for lane change (LC) intention recognition in highways. Built on the CARLA simulator, it advances the state-of-the-art by providing a 50-driver dataset, a large-scale 3D map, and code for reproducibility and new data creation. The 60 km highway map includes varying curvature radii and straight segments. The codebase supports simulation enhancements (traffic management, vehicle cockpit, engine noise) and Machine Learning (ML) model training and evaluation, including CARLA log post-processing into time series. The dataset contains over 3,400 annotated LC maneuvers with synchronized ego dynamics, road geometry, and traffic context. From an automotive industry perspective, we also assess leading-edge ML models on STM32 microcontrollers using deployability metrics. Unlike prior infrastructure-based works, we estimate time-to-LC from ego-centric data. Results show that a Transformer model yields the lowest regression error, while XGBoost offers the best trade-offs on extremely resource-constrained devices. The entire framework is publicly released to support advancement in automated driving research.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"136-140"},"PeriodicalIF":3.9,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11271346","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1109/LSP.2025.3638688
Jun Liu;Wei Ke;Shuai Wang;Da Yang;Hao Sheng
Visual tracking that combines RGB and thermal infrared modalities (RGB-T) aims to utilize the useful information of each modality to achieve more robust object localization. Most existing tracking methods based on convolutional neural networks (CNNs) and Transformers emphasize integrating multi-modal features through cross-modal attention, but ignore the potential exploitability of complementary information learned by cross-modal attention for enhancing modal features. In this paper, we propose a novel hierarchical progressive fusion network based on cross-modal attention guided enhancement for RGB-T tracking. Specifically, the complementary information generated by cross-modal attention implicitly reflects the consistent regions of interest of important information between different modalities, which is used to enhance modal features in a targeted manner. In addition, a modal feature refinement module and a fusion module are designed based on dynamic routing to perform noise suppression and adaptive integration on the enhanced multi-modal features. Extensive experiments on GTOT, RGBT234, LasHeR and VTUAV show that our method has competitive performance compared with recent state-of-the-art methods.
{"title":"Cross-Modal Attention Guided Enhanced Fusion Network for RGB-T Tracking","authors":"Jun Liu;Wei Ke;Shuai Wang;Da Yang;Hao Sheng","doi":"10.1109/LSP.2025.3638688","DOIUrl":"https://doi.org/10.1109/LSP.2025.3638688","url":null,"abstract":"Visual tracking that combines RGB and thermal infrared modalities (RGB-T) aims to utilize the useful information of each modality to achieve more robust object localization. Most existing tracking methods based on convolutional neural networks (CNNs) and Transformers emphasize integrating multi-modal features through cross-modal attention, but ignore the potential exploitability of complementary information learned by cross-modal attention for enhancing modal features. In this paper, we propose a novel hierarchical progressive fusion network based on cross-modal attention guided enhancement for RGB-T tracking. Specifically, the complementary information generated by cross-modal attention implicitly reflects the consistent regions of interest of important information between different modalities, which is used to enhance modal features in a targeted manner. In addition, a modal feature refinement module and a fusion module are designed based on dynamic routing to perform noise suppression and adaptive integration on the enhanced multi-modal features. Extensive experiments on GTOT, RGBT234, LasHeR and VTUAV show that our method has competitive performance compared with recent state-of-the-art methods.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"276-280"},"PeriodicalIF":3.9,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generating realistic future urban remote sensing imagery is critical for visualizing potential urban changes and supporting related technical analysis within urban planning. Traditional 2D-assisted methods are inherently limited in synthesize vertical development and infrastructure evolution, as they rely on binary planning maps. To address these limitations, we propose a novel 2.5D-assisted future urban remote sensing image synthesis method, aimed at generating future urban layouts based on existing urban structures and 2.5D planning maps. Specifically, the 2.5D map is divided into construction and demolition components, which are then integrated with the existing layout images and the embedding of the corresponding text as conditions for our generative model. We further design two trainable cascaded gated attention layers that process these two conditions separately and embed them into the latent diffusion model (LDM). This approach allows our model to dynamically comprehend the planning design requirements for key areas, making adjustments to accommodate diverse demands. Compared to existing state-of-the-art (SoTA) methods, our approach effectively targets design requirements, enabling flexible modifications that involve new constructions and demolitions in relevant urban areas. Experimental results on the 3DCD dataset demonstrate that the images generated by our method retain high fidelity and exhibit strong consistency with the 2.5D planning map.
{"title":"Generative Model for 2.5D-Assisted Future Urban Remote Sensing Image Synthesis","authors":"Yuhan Zhang;Jie Zhou;Weihang Peng;Xiaode Liu;Yuanpei Chen","doi":"10.1109/LSP.2025.3638666","DOIUrl":"https://doi.org/10.1109/LSP.2025.3638666","url":null,"abstract":"Generating realistic future urban remote sensing imagery is critical for visualizing potential urban changes and supporting related technical analysis within urban planning. Traditional 2D-assisted methods are inherently limited in synthesize vertical development and infrastructure evolution, as they rely on binary planning maps. To address these limitations, we propose a novel 2.5D-assisted future urban remote sensing image synthesis method, aimed at generating future urban layouts based on existing urban structures and 2.5D planning maps. Specifically, the 2.5D map is divided into construction and demolition components, which are then integrated with the existing layout images and the embedding of the corresponding text as conditions for our generative model. We further design two trainable cascaded gated attention layers that process these two conditions separately and embed them into the latent diffusion model (LDM). This approach allows our model to dynamically comprehend the planning design requirements for key areas, making adjustments to accommodate diverse demands. Compared to existing state-of-the-art (SoTA) methods, our approach effectively targets design requirements, enabling flexible modifications that involve new constructions and demolitions in relevant urban areas. Experimental results on the 3DCD dataset demonstrate that the images generated by our method retain high fidelity and exhibit strong consistency with the 2.5D planning map.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"33 ","pages":"878-882"},"PeriodicalIF":3.9,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}