Purpose: Estimating the 6 degrees of freedom (DoF) pose of an endoscope is crucial for various applications in minimally invasive computer-assisted surgery. Image-based approaches are some of the most practical solutions for pose estimation in surgical environments, due to a limited workspace and sensor constraints. However, these methods often struggle or fail in dynamic scenes, such as those involving tissue deformation, surgical tool movement, and tool-tissue interaction.
Methods: We propose DyEndoVO, an end-to-end visual odometry method in dynamic endoscopic scenes. Our method consists of a transformer-based motion detection network and a weighted pose-optimization module. The motion detection network infers scene dynamics and guides the pose estimation. Furthermore, we introduce a semi-synthetic dataset featuring tissue and tool movement categories. It serves as training data, improving pose estimation accuracy, and also includes motion masks to enable a fine-grained inspection and evaluation.
Results: DyEndoVO significantly outperforms state-of-the-art methods in pose estimation for dynamic surgical scenes. Despite being trained solely on a synthetic dataset, our method generalizes well to real-world data without fine-tuning. Further analysis attributes this success to the effective detection of scene dynamics and the adaptation in the learned weight toward pose estimation; moreover, the semi-synthetic dataset also plays a key role in bridging the sim-to-real gap.
Conclusions: In this work, we aim to improve the accuracy and robustness of pose estimation in challenging dynamic surgical scenes, by effectively handling scene dynamics. Our method, combined with the proposed synthetic dataset, demonstrates improved performance in pose estimation and generalizes well to real-world data, showing its potential in advancing related works such as SLAM and 3D reconstruction in complex surgical environments.
扫码关注我们
求助内容:
应助结果提醒方式:
