Millimeter-wave (mmWave) radar pointcloud offers attractive potential for 3D sensing, thanks to its robustness in challenging conditions such as smoke and low illumination. However, existing methods failed to simultaneously address the three main challenges in mmWave radar pointcloud reconstruction: specular information lost, low angular resolution, and severe interference. In this paper, we propose DREAM-PCD, a novel framework specifically designed for real-time 3D environment sensing that combines signal processing and deep learning methods into three well-designed components to tackle all three challenges: Non-Coherent Accumulation for dense points, Synthetic Aperture Accumulation for improved angular resolution, and Real-Denoise Multiframe network for interference removal. By leveraging causal multiple viewpoints accumulation and the “real-denoise” mechanism, DREAM-PCD significantly enhances the generalization performance and real-time capability. We also introduce RadarEyes, the largest mmWave indoor dataset with over 1,000,000 frames, featuring a unique design incorporating two orthogonal single-chip radars, Lidar, and camera, enriching dataset diversity and applications. Experimental results demonstrate that DREAM-PCD surpasses existing methods in reconstruction quality, and exhibits superior generalization and real-time capabilities, enabling high-quality real-time reconstruction of radar pointcloud under various parameters and scenarios. We believe that DREAM-PCD, along with the RadarEyes dataset, will significantly advance mmWave radar perception in future real-world applications.
{"title":"DREAM-PCD: Deep Reconstruction and Enhancement of mmWave Radar Pointcloud","authors":"Ruixu Geng;Yadong Li;Dongheng Zhang;Jincheng Wu;Yating Gao;Yang Hu;Yan Chen","doi":"10.1109/TIP.2024.3512356","DOIUrl":"10.1109/TIP.2024.3512356","url":null,"abstract":"Millimeter-wave (mmWave) radar pointcloud offers attractive potential for 3D sensing, thanks to its robustness in challenging conditions such as smoke and low illumination. However, existing methods failed to simultaneously address the three main challenges in mmWave radar pointcloud reconstruction: specular information lost, low angular resolution, and severe interference. In this paper, we propose DREAM-PCD, a novel framework specifically designed for real-time 3D environment sensing that combines signal processing and deep learning methods into three well-designed components to tackle all three challenges: Non-Coherent Accumulation for dense points, Synthetic Aperture Accumulation for improved angular resolution, and Real-Denoise Multiframe network for interference removal. By leveraging causal multiple viewpoints accumulation and the “real-denoise” mechanism, DREAM-PCD significantly enhances the generalization performance and real-time capability. We also introduce RadarEyes, the largest mmWave indoor dataset with over 1,000,000 frames, featuring a unique design incorporating two orthogonal single-chip radars, Lidar, and camera, enriching dataset diversity and applications. Experimental results demonstrate that DREAM-PCD surpasses existing methods in reconstruction quality, and exhibits superior generalization and real-time capabilities, enabling high-quality real-time reconstruction of radar pointcloud under various parameters and scenarios. We believe that DREAM-PCD, along with the RadarEyes dataset, will significantly advance mmWave radar perception in future real-world applications.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6774-6789"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-11DOI: 10.1109/TIP.2024.3512362
Wen Yang;Jinjian Wu;Jupo Ma;Leida Li;Weisheng Dong;Guangming Shi
Motion deblurring is a highly ill-posed problem due to the significant loss of motion information in the blurring process. Complementary informative features from auxiliary sensors such as event cameras can be explored for guiding motion deblurring. The event camera can capture rich motion information asynchronously with microsecond accuracy. In this paper, a novel frame-event fusion framework is proposed for event-driven motion deblurring (FEF-Deblur), which can sufficiently explore long-range cross-modal information interactions. Firstly, different modalities are usually complementary and also redundant. Cross-modal fusion is modeled as complementary-unique features separation-and-aggregation, avoiding the modality redundancy. Unique features and complementary features are first inferred with parallel intra-modal self-attention and inter-modal cross-attention respectively. After that, a correlation-based constraint is designed to act between unique and complementary features to facilitate their differentiation, which assists in cross-modal redundancy suppression. Additionally, spatio-temporal dependencies among neighboring inputs are crucial for motion deblurring. A recurrent cross attention is introduced to preserve inter-input attention information, in which the current spatial features and aggregated temporal features are attending to each other by establishing the long-range interaction between them. Extensive experiments on both synthetic and real-world motion deblurring datasets demonstrate our method outperforms state-of-the-art event-based and image/video-based methods.
{"title":"Learning Frame-Event Fusion for Motion Deblurring","authors":"Wen Yang;Jinjian Wu;Jupo Ma;Leida Li;Weisheng Dong;Guangming Shi","doi":"10.1109/TIP.2024.3512362","DOIUrl":"10.1109/TIP.2024.3512362","url":null,"abstract":"Motion deblurring is a highly ill-posed problem due to the significant loss of motion information in the blurring process. Complementary informative features from auxiliary sensors such as event cameras can be explored for guiding motion deblurring. The event camera can capture rich motion information asynchronously with microsecond accuracy. In this paper, a novel frame-event fusion framework is proposed for event-driven motion deblurring (FEF-Deblur), which can sufficiently explore long-range cross-modal information interactions. Firstly, different modalities are usually complementary and also redundant. Cross-modal fusion is modeled as complementary-unique features separation-and-aggregation, avoiding the modality redundancy. Unique features and complementary features are first inferred with parallel intra-modal self-attention and inter-modal cross-attention respectively. After that, a correlation-based constraint is designed to act between unique and complementary features to facilitate their differentiation, which assists in cross-modal redundancy suppression. Additionally, spatio-temporal dependencies among neighboring inputs are crucial for motion deblurring. A recurrent cross attention is introduced to preserve inter-input attention information, in which the current spatial features and aggregated temporal features are attending to each other by establishing the long-range interaction between them. Extensive experiments on both synthetic and real-world motion deblurring datasets demonstrate our method outperforms state-of-the-art event-based and image/video-based methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6836-6849"},"PeriodicalIF":0.0,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Portrait shadow removal is a challenging task due to the complex surface of the face. Although existing work in this field makes substantial progress, these methods tend to overlook information in the background areas. However, this background information not only contains some important illumination cues but also plays a pivotal role in achieving lighting harmony between the face and the background after shadow elimination. In this paper, we propose a Context-aware Illumination Restoration Network (CIRNet) for portrait shadow removal. Our CIRNet consists of three stages. First, the Coarse Shadow Removal Network (CSRNet) mitigates the illumination discrepancies between shadow and non-shadow areas. Next, the Area-aware Shadow Restoration Network (ASRNet) predicts the illumination characteristics of shadowed areas by utilizing background context and non-shadow portrait context as references. Lastly, we introduce a Global Fusion Network to adaptively merge contextual information from different areas and generate the final shadow removal result. This approach leverages the illumination information from the background region while ensuring a more consistent overall illumination in the generated images. Our approach can also be extended to high-resolution portrait shadow removal and portrait specular highlight removal. Besides, we construct the first real facial shadow dataset for portrait shadow removal, consisting of 6200 pairs of facial images. Qualitative and quantitative comparisons demonstrate the advantages of our proposed dataset as well as our method.
{"title":"Portrait Shadow Removal Using Context-Aware Illumination Restoration Network","authors":"Jiangjian Yu;Ling Zhang;Qing Zhang;Qifei Zhang;Daiguo Zhou;Chao Liang;Chunxia Xiao","doi":"10.1109/TIP.2024.3497802","DOIUrl":"10.1109/TIP.2024.3497802","url":null,"abstract":"Portrait shadow removal is a challenging task due to the complex surface of the face. Although existing work in this field makes substantial progress, these methods tend to overlook information in the background areas. However, this background information not only contains some important illumination cues but also plays a pivotal role in achieving lighting harmony between the face and the background after shadow elimination. In this paper, we propose a Context-aware Illumination Restoration Network (CIRNet) for portrait shadow removal. Our CIRNet consists of three stages. First, the Coarse Shadow Removal Network (CSRNet) mitigates the illumination discrepancies between shadow and non-shadow areas. Next, the Area-aware Shadow Restoration Network (ASRNet) predicts the illumination characteristics of shadowed areas by utilizing background context and non-shadow portrait context as references. Lastly, we introduce a Global Fusion Network to adaptively merge contextual information from different areas and generate the final shadow removal result. This approach leverages the illumination information from the background region while ensuring a more consistent overall illumination in the generated images. Our approach can also be extended to high-resolution portrait shadow removal and portrait specular highlight removal. Besides, we construct the first real facial shadow dataset for portrait shadow removal, consisting of 6200 pairs of facial images. Qualitative and quantitative comparisons demonstrate the advantages of our proposed dataset as well as our method.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1-15"},"PeriodicalIF":0.0,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142782546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-03DOI: 10.1109/TIP.2024.3505672
Zhiwen Chen;Jinjian Wu;Weisheng Dong;Leida Li;Guangming Shi
With the differential sensitivity and high time resolution, event cameras can record detailed motion clues, which form a complementary advantage with frame-based cameras to enhance the object tracking, especially in challenging dynamic scenes. However, how to better match heterogeneous event-image data and exploit rich complementary cues from them still remains an open issue. In this paper, we align event-image modalities by proposing a motion adaptive event sampling method, and we revisit the cross-complementarities of event-image data to design a bidirectional-enhanced fusion framework. Specifically, this sampling strategy can adapt to different dynamic scenes and integrate aligned event-image pairs. Besides, we design an image-guided motion estimation unit for extracting explicit instance-level motions, aiming at refining the uncertain event clues to distinguish primary objects and background. Then, a semantic modulation module is devised to utilize the enhanced object motion to modify the image features. Coupled with these two modules, this framework learns both the high motion sensitivity of events and the full texture of images to achieve more accurate and robust tracking. The proposed method is easily embedded in existing tracking pipelines, and trained end-to-end. We evaluate it on four large benchmarks, i.e. FE108, VisEvent, FE240hz and CoeSot. Extensive experiments demonstrate our method achieves state-of-the-art performance, and large improvements are pointed as contributions by our sampling strategy and fusion concept.
{"title":"CrossEI: Boosting Motion-Oriented Object Tracking With an Event Camera","authors":"Zhiwen Chen;Jinjian Wu;Weisheng Dong;Leida Li;Guangming Shi","doi":"10.1109/TIP.2024.3505672","DOIUrl":"10.1109/TIP.2024.3505672","url":null,"abstract":"With the differential sensitivity and high time resolution, event cameras can record detailed motion clues, which form a complementary advantage with frame-based cameras to enhance the object tracking, especially in challenging dynamic scenes. However, how to better match heterogeneous event-image data and exploit rich complementary cues from them still remains an open issue. In this paper, we align event-image modalities by proposing a motion adaptive event sampling method, and we revisit the cross-complementarities of event-image data to design a bidirectional-enhanced fusion framework. Specifically, this sampling strategy can adapt to different dynamic scenes and integrate aligned event-image pairs. Besides, we design an image-guided motion estimation unit for extracting explicit instance-level motions, aiming at refining the uncertain event clues to distinguish primary objects and background. Then, a semantic modulation module is devised to utilize the enhanced object motion to modify the image features. Coupled with these two modules, this framework learns both the high motion sensitivity of events and the full texture of images to achieve more accurate and robust tracking. The proposed method is easily embedded in existing tracking pipelines, and trained end-to-end. We evaluate it on four large benchmarks, i.e. FE108, VisEvent, FE240hz and CoeSot. Extensive experiments demonstrate our method achieves state-of-the-art performance, and large improvements are pointed as contributions by our sampling strategy and fusion concept.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"73-84"},"PeriodicalIF":0.0,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-28DOI: 10.1109/TIP.2024.3504298
Min Li;Xiaoqin Zhang;Tangfei Liao;Sheng Lin;Guobao Xiao
Pyramid Temporal Hierarchy Network (PTH-Net) is a new paradigm for dynamic facial expression recognition, applied directly to raw videos, without face detection and alignment. Unlike the traditional paradigm, which focus only on facial areas and often overlooks valuable information like body movements, PTH-Net preserves more critical information. It does this by distinguishing between backgrounds and human bodies at the feature level, offering greater flexibility as an end-to-end network. Specifically, PTH-Net utilizes a pre-trained backbone to extract multiple general features of video understanding at various temporal frequencies, forming a temporal feature pyramid. It then further expands this temporal hierarchy through differentiated parameter sharing and downsampling, ultimately refining emotional information under the supervision of expression temporal-frequency invariance. Additionally, PTH-Net features an efficient Scalable Semantic Distinction layer that enhances feature discrimination, helping to better identify target expressions versus non-target ones in the video. Finally, extensive experiments demonstrate that PTH-Net performs excellently in eight challenging benchmarks, with lower computational costs compared to previous methods. The source code is available at https://github.com/lm495455/PTH-Net