Adverse weather conditions such as snow, fog, and rain pose significant challenges to LiDAR-based perception models by introducing noise and corrupting point cloud measurements. To address this issue, we propose TripleMixer, a robust and efficient point cloud denoising network that integrates spatial, frequency, and channel-wise processing through three specialized mixer modules. TripleMixer effectively suppresses high-frequency noise while preserving essential geometric structures and can be seamlessly deployed as a plug-and-play module within existing LiDAR perception pipelines. To support the development and evaluation of denoising methods, we construct two large-scale simulated datasets, Weather-KITTI and Weather-NuScenes, covering diverse weather scenarios with dense point-wise semantic and noise annotations. Based on these datasets, we establish four benchmarks: Denoising, Semantic Segmentation (SS), Place Recognition (PR), and Object Detection (OD). These benchmarks enable systematic evaluation of denoising generalization, transferability, and downstream impact under both simulated and real-world adverse weather conditions. Extensive experiments demonstrate that TripleMixer achieves state-of-the-art denoising performance and yields substantial improvements across all downstream tasks without requiring retraining. Our results highlight the potential of denoising as a task-agnostic preprocessing strategy to enhance LiDAR robustness in real-world autonomous driving applications.
{"title":"TripleMixer: A Triple-Domain Mixing Model for Point Cloud Denoising under Adverse Weather.","authors":"Xiongwei Zhao,Congcong Wen,Xu Zhu,Yang Wang,Haojie Bai,Wenhao Dou","doi":"10.1109/tip.2025.3629047","DOIUrl":"https://doi.org/10.1109/tip.2025.3629047","url":null,"abstract":"Adverse weather conditions such as snow, fog, and rain pose significant challenges to LiDAR-based perception models by introducing noise and corrupting point cloud measurements. To address this issue, we propose TripleMixer, a robust and efficient point cloud denoising network that integrates spatial, frequency, and channel-wise processing through three specialized mixer modules. TripleMixer effectively suppresses high-frequency noise while preserving essential geometric structures and can be seamlessly deployed as a plug-and-play module within existing LiDAR perception pipelines. To support the development and evaluation of denoising methods, we construct two large-scale simulated datasets, Weather-KITTI and Weather-NuScenes, covering diverse weather scenarios with dense point-wise semantic and noise annotations. Based on these datasets, we establish four benchmarks: Denoising, Semantic Segmentation (SS), Place Recognition (PR), and Object Detection (OD). These benchmarks enable systematic evaluation of denoising generalization, transferability, and downstream impact under both simulated and real-world adverse weather conditions. Extensive experiments demonstrate that TripleMixer achieves state-of-the-art denoising performance and yields substantial improvements across all downstream tasks without requiring retraining. Our results highlight the potential of denoising as a task-agnostic preprocessing strategy to enhance LiDAR robustness in real-world autonomous driving applications.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"46 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145559051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1109/tip.2025.3632227
Xiaoshi Qiu,Shiyue Yan,Qingmin Liao,Shaojun Liu
Bokeh is widely used in photography and is traditionally achieved with large-aperture cameras. Bokeh rendering from pictures taken with small-aperture cameras has attracted much attention due to its system simplicity. Most of the existing methods employ Convolutional Neural Networks and often mistakenly blur the foreground due to the limited receptive field. In contrast, Transformers can easily capture long-range dependencies. Therefore, it is more suitable for this problem. However, Transformers suffer from a high computation burden, especially for high-resolution images. In this paper, we propose a Dilated Wavelet Transformer model for Bokeh Rendering (DiWTBR) from a single small-aperture image with megapixels. It employs both window attention and dilated attention schemes, introducing both local and global spatial interactions at a low computation cost. Moreover, to further improve the efficiency, we employ the wavelet transform in the attention block. Experimental results demonstrate that DiWTBR outperforms the state-of-the-art methods by up to 0.7dB in PSNR. Last but not least, our model can be readily implemented on mainstream personal computers and laptops, with only 4G GPU memory consumption. The code will be available on GitHub upon acceptance.
{"title":"DiWTBR: Dilated Wavelet Transformer for Efficient Megapixel Bokeh Rendering.","authors":"Xiaoshi Qiu,Shiyue Yan,Qingmin Liao,Shaojun Liu","doi":"10.1109/tip.2025.3632227","DOIUrl":"https://doi.org/10.1109/tip.2025.3632227","url":null,"abstract":"Bokeh is widely used in photography and is traditionally achieved with large-aperture cameras. Bokeh rendering from pictures taken with small-aperture cameras has attracted much attention due to its system simplicity. Most of the existing methods employ Convolutional Neural Networks and often mistakenly blur the foreground due to the limited receptive field. In contrast, Transformers can easily capture long-range dependencies. Therefore, it is more suitable for this problem. However, Transformers suffer from a high computation burden, especially for high-resolution images. In this paper, we propose a Dilated Wavelet Transformer model for Bokeh Rendering (DiWTBR) from a single small-aperture image with megapixels. It employs both window attention and dilated attention schemes, introducing both local and global spatial interactions at a low computation cost. Moreover, to further improve the efficiency, we employ the wavelet transform in the attention block. Experimental results demonstrate that DiWTBR outperforms the state-of-the-art methods by up to 0.7dB in PSNR. Last but not least, our model can be readily implemented on mainstream personal computers and laptops, with only 4G GPU memory consumption. The code will be available on GitHub upon acceptance.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145545046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1109/tip.2025.3627395
Jinhong Deng,Yinjie Lei,Wen Li,Lixin Duan
Open-vocabulary object detection (OVD) aims to detect novel object concepts by mining region-word correspondences from image-text pairs, yet current methods often produce false correspondences. While some strategies (e.g., one-to-one matching) were proposed to mitigate this issue, they often sacrifice numerous valuable region-word pairs during the matching process. To overcome these challenges, we propose a novel comprehensive alignment method, named Region-word Alignment with Partial Optimal Transport (ROOT) framework, which reframes the region-word matching task as a problem of partial distribution alignment. Unlike traditional optimal transport, which shifts the full mass of the distribution, partial optimal transport enables selective matching, making it more robust to noise in region and word alignment. Specifically, ROOT first employs partial optimal transport to obtain an optimal transport plan for region and word feature alignment. This transport plan is then used to compute a matching reliability score for each region-word pair, which reweights the contrastive alignment loss to enhance accuracy. By enabling more flexible and reliable region-text matches, ROOT significantly reduces misalignment errors while preserving valuable region-word correspondences. Extensive experiments on standard benchmarks OV-COCO and OV-LVIS show that our ROOT outperforms the previous state-of-the-art works, demonstrating the effectiveness of our approach.
{"title":"ROOT: Region-word Alignment with Partial Optimal Transport for Open-vocabulary Object Detection.","authors":"Jinhong Deng,Yinjie Lei,Wen Li,Lixin Duan","doi":"10.1109/tip.2025.3627395","DOIUrl":"https://doi.org/10.1109/tip.2025.3627395","url":null,"abstract":"Open-vocabulary object detection (OVD) aims to detect novel object concepts by mining region-word correspondences from image-text pairs, yet current methods often produce false correspondences. While some strategies (e.g., one-to-one matching) were proposed to mitigate this issue, they often sacrifice numerous valuable region-word pairs during the matching process. To overcome these challenges, we propose a novel comprehensive alignment method, named Region-word Alignment with Partial Optimal Transport (ROOT) framework, which reframes the region-word matching task as a problem of partial distribution alignment. Unlike traditional optimal transport, which shifts the full mass of the distribution, partial optimal transport enables selective matching, making it more robust to noise in region and word alignment. Specifically, ROOT first employs partial optimal transport to obtain an optimal transport plan for region and word feature alignment. This transport plan is then used to compute a matching reliability score for each region-word pair, which reweights the contrastive alignment loss to enhance accuracy. By enabling more flexible and reliable region-text matches, ROOT significantly reduces misalignment errors while preserving valuable region-word correspondences. Extensive experiments on standard benchmarks OV-COCO and OV-LVIS show that our ROOT outperforms the previous state-of-the-art works, demonstrating the effectiveness of our approach.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"130 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145545045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-18DOI: 10.1109/tip.2025.3632223
Xi Yang,Xinyue Zhong,Dechen Kong,Nannan Wang
Prompt learning has made significant progress in vision-language models (VLMs), enabling pre-trained models like CLIP to perform cross-domain tasks with few-shot or even zero-shot learning. However, existing methods tend to overfit the training data after fine-tuning on the target domain, leading to a decline in generalization ability and limiting their performance on unseen categories.To address these challenges, we propose a multi-regularization guided knowledge distillation towards generalizable prompt learning. This approach enhances the model's adaptability and generalization through different stages of regularization while mitigating performance degradation caused by target domain training. Specifically, within the image encoder of CLIP, we introduce Residual Regularization, which binds additional residual connections to certain transformer blocks. This design provides greater flexibility, allowing the model to adjust to new data distributions when adapting to the target domain.Furthermore, during training, we impose Self-distillation Regularization to ensure that while adapting to the target domain, the model preserves its prior generalization knowledge. Specifically, we regularize the intermediate layer outputs of Transformer Blocks to prevent the model from excessively favoring target domain data. Additionally, we employ an unsupervised knowledge distillation strategy to enforce multi-level alignment between the teacher and student models by Direction Distillation Regularization. This ensures that both models maintain consistent visual feature orientations under the same textual features, thereby enhancing overall model stability and cross-domain adaptability.Experimental results demonstrate that our method achieves more stable classification performance in both cross-domain few-shot classification and domain adaptation settings.
{"title":"Towards Generalizable Prompt Learning via Multi-regularization Guided Knowledge Distillation.","authors":"Xi Yang,Xinyue Zhong,Dechen Kong,Nannan Wang","doi":"10.1109/tip.2025.3632223","DOIUrl":"https://doi.org/10.1109/tip.2025.3632223","url":null,"abstract":"Prompt learning has made significant progress in vision-language models (VLMs), enabling pre-trained models like CLIP to perform cross-domain tasks with few-shot or even zero-shot learning. However, existing methods tend to overfit the training data after fine-tuning on the target domain, leading to a decline in generalization ability and limiting their performance on unseen categories.To address these challenges, we propose a multi-regularization guided knowledge distillation towards generalizable prompt learning. This approach enhances the model's adaptability and generalization through different stages of regularization while mitigating performance degradation caused by target domain training. Specifically, within the image encoder of CLIP, we introduce Residual Regularization, which binds additional residual connections to certain transformer blocks. This design provides greater flexibility, allowing the model to adjust to new data distributions when adapting to the target domain.Furthermore, during training, we impose Self-distillation Regularization to ensure that while adapting to the target domain, the model preserves its prior generalization knowledge. Specifically, we regularize the intermediate layer outputs of Transformer Blocks to prevent the model from excessively favoring target domain data. Additionally, we employ an unsupervised knowledge distillation strategy to enforce multi-level alignment between the teacher and student models by Direction Distillation Regularization. This ensures that both models maintain consistent visual feature orientations under the same textual features, thereby enhancing overall model stability and cross-domain adaptability.Experimental results demonstrate that our method achieves more stable classification performance in both cross-domain few-shot classification and domain adaptation settings.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"1 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145545044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}