MaskFusionNet: A Dual-Stream Fusion Model With Masked Pre-Training Mechanism for rPPG Measurement

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-07-03 DOI:10.1109/TCSVT.2024.3422849

Yizhu Zhang;Jingang Shi;Jiayin Wang;Yuan Zong;Wenming Zheng;Guoying Zhao

{"title":"MaskFusionNet: A Dual-Stream Fusion Model With Masked Pre-Training Mechanism for rPPG Measurement","authors":"Yizhu Zhang;Jingang Shi;Jiayin Wang;Yuan Zong;Wenming Zheng;Guoying Zhao","doi":"10.1109/TCSVT.2024.3422849","DOIUrl":null,"url":null,"abstract":"Remote photoplethysmography (rPPG) has considerable significance in areas such as disease diagnosis and emotion analysis. Recent rPPG models have demonstrated excellent performance due to their powerful heart rate information extraction capabilities. However, these models often focus on limited regions of interest (ROI) on facial image, which makes them sensitive to interference. If the ROI is affected by muscle movement, lighting variation and noise, the model’s performance would degrade significantly. To address this limitation, we propose a two-stage model called MaskFusionNet. The model includes two stages: 1) During the pre-training stage, the mask-reconstruction mechanism drives MaskFusionNet to learn rPPG information from various facial regions by applying a tube masking strategy. This enhances the model’s ability to resist interference. Based on the periodicity and continuity of the heart rate signal, we also design a novel spatio-temporal reconstruction loss function that focuses on the data’s spatial features and temporal continuity. 2) In the fine-tuning stage, we propose the Multi-Scale Fusion Block (MFB) to combine multi-scale features from the dual-stream network. It allows the model to detect subtle heart rate variations in adjacent frames while minimizing the impact of interference by extracting features within longer segments. The transformer-based MaskFusionNet can extract multi-scale fused heart rate features from a wide range of skin regions while preserving the modeling capability of long-range sequence information. To validate its effectiveness, we extensively evaluate our model on three benchmark datasets (VIPL-HR, COHFACE, and PURE), demonstrating its superior performance in both intra-dataset and cross-dataset testing scenarios.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 11","pages":"11521-11534"},"PeriodicalIF":11.1000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10583917/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Remote photoplethysmography (rPPG) has considerable significance in areas such as disease diagnosis and emotion analysis. Recent rPPG models have demonstrated excellent performance due to their powerful heart rate information extraction capabilities. However, these models often focus on limited regions of interest (ROI) on facial image, which makes them sensitive to interference. If the ROI is affected by muscle movement, lighting variation and noise, the model’s performance would degrade significantly. To address this limitation, we propose a two-stage model called MaskFusionNet. The model includes two stages: 1) During the pre-training stage, the mask-reconstruction mechanism drives MaskFusionNet to learn rPPG information from various facial regions by applying a tube masking strategy. This enhances the model’s ability to resist interference. Based on the periodicity and continuity of the heart rate signal, we also design a novel spatio-temporal reconstruction loss function that focuses on the data’s spatial features and temporal continuity. 2) In the fine-tuning stage, we propose the Multi-Scale Fusion Block (MFB) to combine multi-scale features from the dual-stream network. It allows the model to detect subtle heart rate variations in adjacent frames while minimizing the impact of interference by extracting features within longer segments. The transformer-based MaskFusionNet can extract multi-scale fused heart rate features from a wide range of skin regions while preserving the modeling capability of long-range sequence information. To validate its effectiveness, we extensively evaluate our model on three benchmark datasets (VIPL-HR, COHFACE, and PURE), demonstrating its superior performance in both intra-dataset and cross-dataset testing scenarios.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MaskFusionNet：用于 rPPG 测量的带掩码预训练机制的双流融合模型

远程心电图（rPPG）在疾病诊断和情绪分析等领域具有重要意义。由于具有强大的心率信息提取能力，最新的 rPPG 模型已显示出卓越的性能。然而，这些模型通常只关注面部图像上有限的感兴趣区域（ROI），因此对干扰非常敏感。如果感兴趣区受到肌肉运动、光照变化和噪声的影响，模型的性能就会明显下降。为了解决这一局限性，我们提出了一种名为 MaskFusionNet 的两阶段模型。该模型包括两个阶段：1) 在预训练阶段，面具重构机制驱动 MaskFusionNet 通过应用管状屏蔽策略从不同的面部区域学习 rPPG 信息。这增强了模型的抗干扰能力。基于心率信号的周期性和连续性，我们还设计了一种新颖的时空重构损失函数，重点关注数据的空间特征和时间连续性。2) 在微调阶段，我们提出了多尺度融合块（MFB）来结合双流网络的多尺度特征。它允许模型检测相邻帧中细微的心率变化，同时通过提取较长片段中的特征，将干扰的影响降至最低。基于变压器的 MaskFusionNet 可以从广泛的皮肤区域提取多尺度融合心率特征，同时保留长距离序列信息的建模能力。为了验证其有效性，我们在三个基准数据集（VIPL-HR、COHFACE 和 PURE）上对我们的模型进行了广泛评估，证明了它在数据集内和跨数据集测试场景中的卓越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.

期刊最新文献

Class Knowledge-Guided Lightweight Network for Salient Object Detection of Strip Steel Surface Defect M4FT: Mamba, Migratory, Mobile, and Multiple Fish Tracking Collaborative Model and Data Adaptation at Test Time Monotonic Rank Knowledge Distillation via Kendall Correlation WmLSTM: A Plug-and-Play Window-Level mLSTM-Based Temporal Encoder for Robust Visual Tracking