Feng Li , Zetao Huang , Lu Zhou , Haixia Peng , Yimin Chu
{"title":"用于视频息肉分割的半监督时空校准和语义细化网络","authors":"Feng Li , Zetao Huang , Lu Zhou , Haixia Peng , Yimin Chu","doi":"10.1016/j.bspc.2024.107127","DOIUrl":null,"url":null,"abstract":"<div><div>Automated video polyp segmentation (VPS) was of vitality for the early prevention and diagnosis of colorectal cancer (CRC). However, existing deep learning-based automatic polyp segmentation methods mainly focused on independent static images and struggled to perform well due to neglecting spatial–temporal relationships among successive video frames, while requiring massive frame-by-frame annotations. To better alleviate these challenges, we proposed a novel semi-supervised spatial–temporal calibration and semantic refinement network (STCSR-Net) dedicated to VPS, which simultaneously considered both the inter-frame temporal consistency in video clips and intra-frame semantic-spatial information. It was composed of a segmentation pathway and a propagation pathway by use of a co-training scheme for supervising the predictions on un-annotated images in a semi-supervised learning fashion. Specifically, we proposed an adaptive sequence calibration (ASC) block in segmentation pathway and a dynamic transmission calibration (DTC) block in propagation pathway to fully take advantage of valuable temporal cues and maintain the prediction temporally consistent among consecutive frames. Meanwhile, in these two branches, we introduced residual block (RB) to suppress irrelevant noisy information and highlight rich local boundary details of polyp lesions, while constructed multi-scale context extraction (MCE) module to enhance multi-scale high-level semantic feature expression. On that basis, we designed progressive adaptive context fusion (PACF) module to gradually aggregate multi-level features under the guidance of reinforced high-level semantic information for eliminating semantic gaps among them and promoting the discrimination capacity of features for targeting polyp objects. Through synergistic combination of RB, MCE and PACF modules, semantic-spatial correlations on polyp lesions within each frame could be established. Coupled with the context-free loss, our model merged feature representations of neighboring frames to diminish the dependency on varying contexts within consecutive frames and strengthen its robustness. Extensive experiments substantiated that our model with 100% annotation ratio achieved state-of-the-art performance on challenging datasets. Even trained under 50% annotation ratio, our model exceled significantly existing state-of-the-art image-based and video-based polyp segmentation models on the newly-built local TRPolyp dataset by at least 1.3% and 0.9% enhancements in both mDice and mIoU, whilst exhibited comparable performance to top rivals attained through using fully supervised approach on publicly available CVC-612, CVC-300 and ASU-Mayo-Clinic benchmarks. Notably, our model showcased exceptionally well in videos containing complex scenarios like motion blur and occlusion. Beyond that, it also harvested approximately 0.794 mDice and 0.707 mIoU at an inference of 0.036 s per frame in endoscopist-machine competition, which outperformed junior and senior endoscopists as well as almost matched with those of expert ones. The strong capability of the proposed STCSR-Net held promise in improving quality of VPS, accentuating model’s adaptability and potential in real-world clinical scenarios.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"100 ","pages":"Article 107127"},"PeriodicalIF":4.9000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semi-supervised spatial-temporal calibration and semantic refinement network for video polyp segmentation\",\"authors\":\"Feng Li , Zetao Huang , Lu Zhou , Haixia Peng , Yimin Chu\",\"doi\":\"10.1016/j.bspc.2024.107127\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automated video polyp segmentation (VPS) was of vitality for the early prevention and diagnosis of colorectal cancer (CRC). However, existing deep learning-based automatic polyp segmentation methods mainly focused on independent static images and struggled to perform well due to neglecting spatial–temporal relationships among successive video frames, while requiring massive frame-by-frame annotations. To better alleviate these challenges, we proposed a novel semi-supervised spatial–temporal calibration and semantic refinement network (STCSR-Net) dedicated to VPS, which simultaneously considered both the inter-frame temporal consistency in video clips and intra-frame semantic-spatial information. It was composed of a segmentation pathway and a propagation pathway by use of a co-training scheme for supervising the predictions on un-annotated images in a semi-supervised learning fashion. Specifically, we proposed an adaptive sequence calibration (ASC) block in segmentation pathway and a dynamic transmission calibration (DTC) block in propagation pathway to fully take advantage of valuable temporal cues and maintain the prediction temporally consistent among consecutive frames. Meanwhile, in these two branches, we introduced residual block (RB) to suppress irrelevant noisy information and highlight rich local boundary details of polyp lesions, while constructed multi-scale context extraction (MCE) module to enhance multi-scale high-level semantic feature expression. On that basis, we designed progressive adaptive context fusion (PACF) module to gradually aggregate multi-level features under the guidance of reinforced high-level semantic information for eliminating semantic gaps among them and promoting the discrimination capacity of features for targeting polyp objects. Through synergistic combination of RB, MCE and PACF modules, semantic-spatial correlations on polyp lesions within each frame could be established. Coupled with the context-free loss, our model merged feature representations of neighboring frames to diminish the dependency on varying contexts within consecutive frames and strengthen its robustness. Extensive experiments substantiated that our model with 100% annotation ratio achieved state-of-the-art performance on challenging datasets. Even trained under 50% annotation ratio, our model exceled significantly existing state-of-the-art image-based and video-based polyp segmentation models on the newly-built local TRPolyp dataset by at least 1.3% and 0.9% enhancements in both mDice and mIoU, whilst exhibited comparable performance to top rivals attained through using fully supervised approach on publicly available CVC-612, CVC-300 and ASU-Mayo-Clinic benchmarks. Notably, our model showcased exceptionally well in videos containing complex scenarios like motion blur and occlusion. Beyond that, it also harvested approximately 0.794 mDice and 0.707 mIoU at an inference of 0.036 s per frame in endoscopist-machine competition, which outperformed junior and senior endoscopists as well as almost matched with those of expert ones. The strong capability of the proposed STCSR-Net held promise in improving quality of VPS, accentuating model’s adaptability and potential in real-world clinical scenarios.</div></div>\",\"PeriodicalId\":55362,\"journal\":{\"name\":\"Biomedical Signal Processing and Control\",\"volume\":\"100 \",\"pages\":\"Article 107127\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2024-11-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biomedical Signal Processing and Control\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1746809424011856\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1746809424011856","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
Semi-supervised spatial-temporal calibration and semantic refinement network for video polyp segmentation
Automated video polyp segmentation (VPS) was of vitality for the early prevention and diagnosis of colorectal cancer (CRC). However, existing deep learning-based automatic polyp segmentation methods mainly focused on independent static images and struggled to perform well due to neglecting spatial–temporal relationships among successive video frames, while requiring massive frame-by-frame annotations. To better alleviate these challenges, we proposed a novel semi-supervised spatial–temporal calibration and semantic refinement network (STCSR-Net) dedicated to VPS, which simultaneously considered both the inter-frame temporal consistency in video clips and intra-frame semantic-spatial information. It was composed of a segmentation pathway and a propagation pathway by use of a co-training scheme for supervising the predictions on un-annotated images in a semi-supervised learning fashion. Specifically, we proposed an adaptive sequence calibration (ASC) block in segmentation pathway and a dynamic transmission calibration (DTC) block in propagation pathway to fully take advantage of valuable temporal cues and maintain the prediction temporally consistent among consecutive frames. Meanwhile, in these two branches, we introduced residual block (RB) to suppress irrelevant noisy information and highlight rich local boundary details of polyp lesions, while constructed multi-scale context extraction (MCE) module to enhance multi-scale high-level semantic feature expression. On that basis, we designed progressive adaptive context fusion (PACF) module to gradually aggregate multi-level features under the guidance of reinforced high-level semantic information for eliminating semantic gaps among them and promoting the discrimination capacity of features for targeting polyp objects. Through synergistic combination of RB, MCE and PACF modules, semantic-spatial correlations on polyp lesions within each frame could be established. Coupled with the context-free loss, our model merged feature representations of neighboring frames to diminish the dependency on varying contexts within consecutive frames and strengthen its robustness. Extensive experiments substantiated that our model with 100% annotation ratio achieved state-of-the-art performance on challenging datasets. Even trained under 50% annotation ratio, our model exceled significantly existing state-of-the-art image-based and video-based polyp segmentation models on the newly-built local TRPolyp dataset by at least 1.3% and 0.9% enhancements in both mDice and mIoU, whilst exhibited comparable performance to top rivals attained through using fully supervised approach on publicly available CVC-612, CVC-300 and ASU-Mayo-Clinic benchmarks. Notably, our model showcased exceptionally well in videos containing complex scenarios like motion blur and occlusion. Beyond that, it also harvested approximately 0.794 mDice and 0.707 mIoU at an inference of 0.036 s per frame in endoscopist-machine competition, which outperformed junior and senior endoscopists as well as almost matched with those of expert ones. The strong capability of the proposed STCSR-Net held promise in improving quality of VPS, accentuating model’s adaptability and potential in real-world clinical scenarios.
期刊介绍:
Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management.
Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.