{"title":"MaskFusionNet: A Dual-Stream Fusion Model With Masked Pre-Training Mechanism for rPPG Measurement","authors":"Yizhu Zhang;Jingang Shi;Jiayin Wang;Yuan Zong;Wenming Zheng;Guoying Zhao","doi":"10.1109/TCSVT.2024.3422849","DOIUrl":null,"url":null,"abstract":"Remote photoplethysmography (rPPG) has considerable significance in areas such as disease diagnosis and emotion analysis. Recent rPPG models have demonstrated excellent performance due to their powerful heart rate information extraction capabilities. However, these models often focus on limited regions of interest (ROI) on facial image, which makes them sensitive to interference. If the ROI is affected by muscle movement, lighting variation and noise, the model’s performance would degrade significantly. To address this limitation, we propose a two-stage model called MaskFusionNet. The model includes two stages: 1) During the pre-training stage, the mask-reconstruction mechanism drives MaskFusionNet to learn rPPG information from various facial regions by applying a tube masking strategy. This enhances the model’s ability to resist interference. Based on the periodicity and continuity of the heart rate signal, we also design a novel spatio-temporal reconstruction loss function that focuses on the data’s spatial features and temporal continuity. 2) In the fine-tuning stage, we propose the Multi-Scale Fusion Block (MFB) to combine multi-scale features from the dual-stream network. It allows the model to detect subtle heart rate variations in adjacent frames while minimizing the impact of interference by extracting features within longer segments. The transformer-based MaskFusionNet can extract multi-scale fused heart rate features from a wide range of skin regions while preserving the modeling capability of long-range sequence information. To validate its effectiveness, we extensively evaluate our model on three benchmark datasets (VIPL-HR, COHFACE, and PURE), demonstrating its superior performance in both intra-dataset and cross-dataset testing scenarios.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 11","pages":"11521-11534"},"PeriodicalIF":11.1000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10583917/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Remote photoplethysmography (rPPG) has considerable significance in areas such as disease diagnosis and emotion analysis. Recent rPPG models have demonstrated excellent performance due to their powerful heart rate information extraction capabilities. However, these models often focus on limited regions of interest (ROI) on facial image, which makes them sensitive to interference. If the ROI is affected by muscle movement, lighting variation and noise, the model’s performance would degrade significantly. To address this limitation, we propose a two-stage model called MaskFusionNet. The model includes two stages: 1) During the pre-training stage, the mask-reconstruction mechanism drives MaskFusionNet to learn rPPG information from various facial regions by applying a tube masking strategy. This enhances the model’s ability to resist interference. Based on the periodicity and continuity of the heart rate signal, we also design a novel spatio-temporal reconstruction loss function that focuses on the data’s spatial features and temporal continuity. 2) In the fine-tuning stage, we propose the Multi-Scale Fusion Block (MFB) to combine multi-scale features from the dual-stream network. It allows the model to detect subtle heart rate variations in adjacent frames while minimizing the impact of interference by extracting features within longer segments. The transformer-based MaskFusionNet can extract multi-scale fused heart rate features from a wide range of skin regions while preserving the modeling capability of long-range sequence information. To validate its effectiveness, we extensively evaluate our model on three benchmark datasets (VIPL-HR, COHFACE, and PURE), demonstrating its superior performance in both intra-dataset and cross-dataset testing scenarios.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.