三重监督卷积变换器聚合用于稳健的单目内窥镜密集深度估计

IF 3.4 Q2 ENGINEERING, BIOMEDICAL IEEE transactions on medical robotics and bionics Pub Date : 2024-03-31 DOI:10.1109/TMRB.2024.3407384
Wenkang Fan;Wenjing Jiang;Hong Shi;Hui-Qing Zeng;Yinran Chen;Xiongbiao Luo
{"title":"三重监督卷积变换器聚合用于稳健的单目内窥镜密集深度估计","authors":"Wenkang Fan;Wenjing Jiang;Hong Shi;Hui-Qing Zeng;Yinran Chen;Xiongbiao Luo","doi":"10.1109/TMRB.2024.3407384","DOIUrl":null,"url":null,"abstract":"Accurate deeply learned dense depth prediction remains a challenge to monocular vision reconstruction. Compared to monocular depth estimation from natural images, endoscopic dense depth prediction is even more challenging. While it is difficult to annotate endoscopic video data for supervised learning, endoscopic video images certainly suffer from illumination variations (limited lighting source, limited field of viewing, and specular highlight), smooth and textureless surfaces in surgical complex fields. This work explores a new deep learning framework of triple-supervised convolutional transformer aggregation (TSCTA) for monocular endoscopic dense depth recovery without annotating any data. Specifically, TSCTA creates convolutional transformer aggregation networks with a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features, while it builds a local and global aggregation decoder to effectively aggregate global features and local features from coarse to fine. Moreover, we develop a self-supervised learning framework with triple supervision, which integrates minimum photometric consistency and depth consistency with sparse depth self-supervision to train our model by unannotated data. We evaluated TSCTA on unannotated monocular endoscopic images collected from various surgical procedures, with the experimental results showing that our methods can achieve more accurate depth range, more complete depth distribution, more sufficient textures, better qualitative and quantitative assessment results than state-of-the-art deeply learned monocular dense depth estimation methods.","PeriodicalId":73318,"journal":{"name":"IEEE transactions on medical robotics and bionics","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Triple-Supervised Convolutional Transformer Aggregation for Robust Monocular Endoscopic Dense Depth Estimation\",\"authors\":\"Wenkang Fan;Wenjing Jiang;Hong Shi;Hui-Qing Zeng;Yinran Chen;Xiongbiao Luo\",\"doi\":\"10.1109/TMRB.2024.3407384\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate deeply learned dense depth prediction remains a challenge to monocular vision reconstruction. Compared to monocular depth estimation from natural images, endoscopic dense depth prediction is even more challenging. While it is difficult to annotate endoscopic video data for supervised learning, endoscopic video images certainly suffer from illumination variations (limited lighting source, limited field of viewing, and specular highlight), smooth and textureless surfaces in surgical complex fields. This work explores a new deep learning framework of triple-supervised convolutional transformer aggregation (TSCTA) for monocular endoscopic dense depth recovery without annotating any data. Specifically, TSCTA creates convolutional transformer aggregation networks with a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features, while it builds a local and global aggregation decoder to effectively aggregate global features and local features from coarse to fine. Moreover, we develop a self-supervised learning framework with triple supervision, which integrates minimum photometric consistency and depth consistency with sparse depth self-supervision to train our model by unannotated data. We evaluated TSCTA on unannotated monocular endoscopic images collected from various surgical procedures, with the experimental results showing that our methods can achieve more accurate depth range, more complete depth distribution, more sufficient textures, better qualitative and quantitative assessment results than state-of-the-art deeply learned monocular dense depth estimation methods.\",\"PeriodicalId\":73318,\"journal\":{\"name\":\"IEEE transactions on medical robotics and bionics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-03-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on medical robotics and bionics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10545340/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical robotics and bionics","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10545340/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0

摘要

精确的深度学习密集深度预测仍然是单目视觉重建的一项挑战。与从自然图像进行单目深度估计相比,内窥镜密集深度预测更具挑战性。虽然很难对内窥镜视频数据进行注释以进行监督学习,但内窥镜视频图像肯定会受到光照变化(有限的光源、有限的视野和镜面高光)、手术复杂领域中光滑和无纹理表面的影响。本研究探索了一种新的深度学习框架--三重监督卷积变换器聚合(TSCTA),用于单眼内窥镜密集深度恢复,无需注释任何数据。具体来说,TSCTA 利用新的混合编码器创建卷积变换器聚合网络,该编码器结合了密集卷积和可扩展变换器,可并行提取局部纹理特征和全局时空特征,同时它还建立了局部和全局聚合解码器,可有效聚合从粗到细的全局特征和局部特征。此外,我们还开发了一个具有三重监督的自监督学习框架,该框架将最小光度一致性和深度一致性与稀疏深度自监督整合在一起,通过无标注数据来训练我们的模型。实验结果表明,与最先进的深度学习单目密集深度估计方法相比,我们的方法能获得更准确的深度范围、更完整的深度分布、更充分的纹理以及更好的定性和定量评估结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Triple-Supervised Convolutional Transformer Aggregation for Robust Monocular Endoscopic Dense Depth Estimation
Accurate deeply learned dense depth prediction remains a challenge to monocular vision reconstruction. Compared to monocular depth estimation from natural images, endoscopic dense depth prediction is even more challenging. While it is difficult to annotate endoscopic video data for supervised learning, endoscopic video images certainly suffer from illumination variations (limited lighting source, limited field of viewing, and specular highlight), smooth and textureless surfaces in surgical complex fields. This work explores a new deep learning framework of triple-supervised convolutional transformer aggregation (TSCTA) for monocular endoscopic dense depth recovery without annotating any data. Specifically, TSCTA creates convolutional transformer aggregation networks with a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features, while it builds a local and global aggregation decoder to effectively aggregate global features and local features from coarse to fine. Moreover, we develop a self-supervised learning framework with triple supervision, which integrates minimum photometric consistency and depth consistency with sparse depth self-supervision to train our model by unannotated data. We evaluated TSCTA on unannotated monocular endoscopic images collected from various surgical procedures, with the experimental results showing that our methods can achieve more accurate depth range, more complete depth distribution, more sufficient textures, better qualitative and quantitative assessment results than state-of-the-art deeply learned monocular dense depth estimation methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.80
自引率
0.00%
发文量
0
期刊最新文献
Table of Contents IEEE Transactions on Medical Robotics and Bionics Society Information Guest Editorial Special section on the Hamlyn Symposium 2023—Immersive Tech: The Future of Medicine IEEE Transactions on Medical Robotics and Bionics Publication Information IEEE Transactions on Medical Robotics and Bionics Information for Authors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1