V²-SfMLearner: Learning Monocular Depth and Ego-Motion for Multimodal Wireless Capsule Endoscopy

IF 6.4 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS IEEE Transactions on Automation Science and Engineering Pub Date : 2025-01-16 DOI:10.1109/TASE.2025.3530791

Long Bai;Beilei Cui;Liangyu Wang;Yanheng Li;Shilong Yao;Sishen Yuan;Yanan Wu;Yang Zhang;Max Q.-H. Meng;Zhen Li;Weiping Ding;Hongliang Ren

{"title":"V²-SfMLearner: Learning Monocular Depth and Ego-Motion for Multimodal Wireless Capsule Endoscopy","authors":"Long Bai;Beilei Cui;Liangyu Wang;Yanheng Li;Shilong Yao;Sishen Yuan;Yanan Wu;Yang Zhang;Max Q.-H. Meng;Zhen Li;Weiping Ding;Hongliang Ren","doi":"10.1109/TASE.2025.3530791","DOIUrl":null,"url":null,"abstract":"Deep learning can predict depth maps and capsule ego-motion from capsule endoscopy videos, aiding in 3D scene reconstruction and lesion localization. However, the collisions of the capsule endoscopies within the gastrointestinal tract cause vibration perturbations in the training data. Existing solutions focus solely on vision-based processing, neglecting other auxiliary signals like vibrations that could reduce noise and improve performance. Therefore, we propose V2-SfMLearner, a multimodal approach integrating vibration signals into vision-based depth and capsule motion estimation for monocular capsule endoscopy. We construct a multimodal capsule endoscopy dataset containing vibration and visual signals, and our artificial intelligence solution develops an unsupervised method using vision-vibration signals, effectively eliminating vibration perturbations through multimodal learning. Specifically, we carefully design a vibration network branch and a Fourier fusion module, to detect and mitigate vibration noises. The fusion framework is compatible with popular vision-only algorithms. Extensive validation on the multimodal dataset demonstrates superior performance and robustness against vision-only algorithms. Without the need for large external equipment, our V2-SfMLearner has the potential for integration into clinical capsule robots, providing real-time and dependable digestive examination tools. The findings show promise for practical implementation in clinical settings, enhancing the diagnostic capabilities of doctors. Note to Practitioners—This paper is motivated by the problem of estimating the depth and ego-motion information for the wireless capsule endoscopy in the human gastrointestinal tract to realize accurate, efficient, robust, and real-time inspection. Our estimation method does not engage any external localization equipment. Instead, inspired by the existing research on integrating capsule endoscopy and inertial measurement units, we introduce vibration signals into vision-based depth and ego-motion estimation approaches, improving the accuracy and robustness of the estimation results based on multimodal learning methods. Research on capsule robots or computer vision can readily be combined with our framework for various clinical and industrial applications.","PeriodicalId":51060,"journal":{"name":"IEEE Transactions on Automation Science and Engineering","volume":"22 ","pages":"11717-11730"},"PeriodicalIF":6.4000,"publicationDate":"2025-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10843755","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automation Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10843755/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning can predict depth maps and capsule ego-motion from capsule endoscopy videos, aiding in 3D scene reconstruction and lesion localization. However, the collisions of the capsule endoscopies within the gastrointestinal tract cause vibration perturbations in the training data. Existing solutions focus solely on vision-based processing, neglecting other auxiliary signals like vibrations that could reduce noise and improve performance. Therefore, we propose V2-SfMLearner, a multimodal approach integrating vibration signals into vision-based depth and capsule motion estimation for monocular capsule endoscopy. We construct a multimodal capsule endoscopy dataset containing vibration and visual signals, and our artificial intelligence solution develops an unsupervised method using vision-vibration signals, effectively eliminating vibration perturbations through multimodal learning. Specifically, we carefully design a vibration network branch and a Fourier fusion module, to detect and mitigate vibration noises. The fusion framework is compatible with popular vision-only algorithms. Extensive validation on the multimodal dataset demonstrates superior performance and robustness against vision-only algorithms. Without the need for large external equipment, our V2-SfMLearner has the potential for integration into clinical capsule robots, providing real-time and dependable digestive examination tools. The findings show promise for practical implementation in clinical settings, enhancing the diagnostic capabilities of doctors. Note to Practitioners—This paper is motivated by the problem of estimating the depth and ego-motion information for the wireless capsule endoscopy in the human gastrointestinal tract to realize accurate, efficient, robust, and real-time inspection. Our estimation method does not engage any external localization equipment. Instead, inspired by the existing research on integrating capsule endoscopy and inertial measurement units, we introduce vibration signals into vision-based depth and ego-motion estimation approaches, improving the accuracy and robustness of the estimation results based on multimodal learning methods. Research on capsule robots or computer vision can readily be combined with our framework for various clinical and industrial applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

V2-SfMLearner：学习多模态无线胶囊内窥镜的单眼深度和自我运动

深度学习可以从胶囊内窥镜视频中预测深度图和胶囊自我运动，帮助3D场景重建和病灶定位。然而，胶囊内窥镜在胃肠道内的碰撞会导致训练数据中的振动扰动。现有的解决方案只关注基于视觉的处理，而忽略了其他辅助信号，如振动，可以减少噪音和提高性能。因此，我们提出了V2-SfMLearner，这是一种将振动信号集成到基于视觉的单眼胶囊内窥镜深度和胶囊运动估计中的多模态方法。我们构建了一个包含振动和视觉信号的多模态胶囊内窥镜数据集，我们的人工智能解决方案开发了一种使用视觉振动信号的无监督方法，通过多模态学习有效地消除了振动扰动。具体而言，我们精心设计了振动网络分支和傅立叶融合模块，以检测和减轻振动噪声。融合框架与流行的仅视觉算法兼容。在多模态数据集上的广泛验证表明，该算法对仅视觉算法具有优越的性能和鲁棒性。不需要大型外部设备，我们的V2-SfMLearner有潜力集成到临床胶囊机器人中，提供实时可靠的消化检查工具。研究结果显示了在临床环境中实际实施的希望，提高了医生的诊断能力。从业人员注意：本文的动机是为了实现无线胶囊内窥镜在人体胃肠道中的深度和自我运动信息的估计问题，以实现准确、高效、鲁棒和实时的检查。我们的估计方法不使用任何外部定位设备。相反，受胶囊内窥镜与惯性测量单元集成研究的启发，我们将振动信号引入到基于视觉的深度和自运动估计方法中，提高了基于多模态学习方法的估计结果的准确性和鲁棒性。胶囊机器人或计算机视觉的研究可以很容易地与我们的框架相结合，用于各种临床和工业应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Automation Science and Engineering 工程技术-自动化与控制系统

CiteScore

12.50

自引率

14.30%

发文量

404

审稿时长

3.0 months

期刊介绍： The IEEE Transactions on Automation Science and Engineering (T-ASE) publishes fundamental papers on Automation, emphasizing scientific results that advance efficiency, quality, productivity, and reliability. T-ASE encourages interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, operations research, and other fields. T-ASE welcomes results relevant to industries such as agriculture, biotechnology, healthcare, home automation, maintenance, manufacturing, pharmaceuticals, retail, security, service, supply chains, and transportation. T-ASE addresses a research community willing to integrate knowledge across disciplines and industries. For this purpose, each paper includes a Note to Practitioners that summarizes how its results can be applied or how they might be extended to apply in practice.