Wenkang Fan;Wenjing Jiang;Hong Shi;Hui-Qing Zeng;Yinran Chen;Xiongbiao Luo
{"title":"三重监督卷积变换器聚合用于稳健的单目内窥镜密集深度估计","authors":"Wenkang Fan;Wenjing Jiang;Hong Shi;Hui-Qing Zeng;Yinran Chen;Xiongbiao Luo","doi":"10.1109/TMRB.2024.3407384","DOIUrl":null,"url":null,"abstract":"Accurate deeply learned dense depth prediction remains a challenge to monocular vision reconstruction. Compared to monocular depth estimation from natural images, endoscopic dense depth prediction is even more challenging. While it is difficult to annotate endoscopic video data for supervised learning, endoscopic video images certainly suffer from illumination variations (limited lighting source, limited field of viewing, and specular highlight), smooth and textureless surfaces in surgical complex fields. This work explores a new deep learning framework of triple-supervised convolutional transformer aggregation (TSCTA) for monocular endoscopic dense depth recovery without annotating any data. Specifically, TSCTA creates convolutional transformer aggregation networks with a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features, while it builds a local and global aggregation decoder to effectively aggregate global features and local features from coarse to fine. Moreover, we develop a self-supervised learning framework with triple supervision, which integrates minimum photometric consistency and depth consistency with sparse depth self-supervision to train our model by unannotated data. We evaluated TSCTA on unannotated monocular endoscopic images collected from various surgical procedures, with the experimental results showing that our methods can achieve more accurate depth range, more complete depth distribution, more sufficient textures, better qualitative and quantitative assessment results than state-of-the-art deeply learned monocular dense depth estimation methods.","PeriodicalId":73318,"journal":{"name":"IEEE transactions on medical robotics and bionics","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Triple-Supervised Convolutional Transformer Aggregation for Robust Monocular Endoscopic Dense Depth Estimation\",\"authors\":\"Wenkang Fan;Wenjing Jiang;Hong Shi;Hui-Qing Zeng;Yinran Chen;Xiongbiao Luo\",\"doi\":\"10.1109/TMRB.2024.3407384\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate deeply learned dense depth prediction remains a challenge to monocular vision reconstruction. Compared to monocular depth estimation from natural images, endoscopic dense depth prediction is even more challenging. While it is difficult to annotate endoscopic video data for supervised learning, endoscopic video images certainly suffer from illumination variations (limited lighting source, limited field of viewing, and specular highlight), smooth and textureless surfaces in surgical complex fields. This work explores a new deep learning framework of triple-supervised convolutional transformer aggregation (TSCTA) for monocular endoscopic dense depth recovery without annotating any data. Specifically, TSCTA creates convolutional transformer aggregation networks with a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features, while it builds a local and global aggregation decoder to effectively aggregate global features and local features from coarse to fine. Moreover, we develop a self-supervised learning framework with triple supervision, which integrates minimum photometric consistency and depth consistency with sparse depth self-supervision to train our model by unannotated data. We evaluated TSCTA on unannotated monocular endoscopic images collected from various surgical procedures, with the experimental results showing that our methods can achieve more accurate depth range, more complete depth distribution, more sufficient textures, better qualitative and quantitative assessment results than state-of-the-art deeply learned monocular dense depth estimation methods.\",\"PeriodicalId\":73318,\"journal\":{\"name\":\"IEEE transactions on medical robotics and bionics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on medical robotics and bionics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10545340/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical robotics and bionics","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10545340/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
Accurate deeply learned dense depth prediction remains a challenge to monocular vision reconstruction. Compared to monocular depth estimation from natural images, endoscopic dense depth prediction is even more challenging. While it is difficult to annotate endoscopic video data for supervised learning, endoscopic video images certainly suffer from illumination variations (limited lighting source, limited field of viewing, and specular highlight), smooth and textureless surfaces in surgical complex fields. This work explores a new deep learning framework of triple-supervised convolutional transformer aggregation (TSCTA) for monocular endoscopic dense depth recovery without annotating any data. Specifically, TSCTA creates convolutional transformer aggregation networks with a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features, while it builds a local and global aggregation decoder to effectively aggregate global features and local features from coarse to fine. Moreover, we develop a self-supervised learning framework with triple supervision, which integrates minimum photometric consistency and depth consistency with sparse depth self-supervision to train our model by unannotated data. We evaluated TSCTA on unannotated monocular endoscopic images collected from various surgical procedures, with the experimental results showing that our methods can achieve more accurate depth range, more complete depth distribution, more sufficient textures, better qualitative and quantitative assessment results than state-of-the-art deeply learned monocular dense depth estimation methods.