Background
Lack of depth perception from medical imaging systems is one of the long-standing technological limitations of minimally invasive surgeries. The ability to visualize anatomical structures in 3D can improve conventional arthroscopic surgeries, as a full 3D semantic representation of the surgical site can directly improve surgeons’ ability. It also brings the possibility of intraoperative image registration with preoperative clinical records for the development of semi-autonomous, and fully autonomous platforms. This study aimed to present a novel monocular depth prediction model to infer depth maps from a single-color arthroscopic video frame.
Methods
We applied a novel technique that provides the ability to combine both supervised and self-supervised loss terms and thus eliminate the drawback of each technique. It enabled the estimation of edge-preserving depth maps from a single untextured arthroscopic frame. The proposed image acquisition technique projected artificial textures on the surface to improve the quality of disparity maps from stereo images. Moreover, following the integration of the attention-ware multi-scale feature extraction technique along with scene global contextual constraints and multiscale depth fusion, the model could predict reliable and accurate tissue depth of the surgical sites that complies with scene geometry.
Results
A total of 4,128 stereo frames from a knee phantom were used to train a network, and during the pre-trained stage, the network learned disparity maps from the stereo images. The fine-tuned training phase uses 12,695 knee arthroscopic stereo frames from cadaver experiments along with their corresponding coarse disparity maps obtained from the stereo matching technique. In a supervised fashion, the network learns the left image to the disparity map transformation process, whereas the self-supervised loss term refines the coarse depth map by minimizing reprojection, gradients, and structural dissimilarity loss. Together, our method produces high-quality 3D maps with minimum re-projection loss that are 0.0004132 (structural similarity index), 0.00036120156 (L1 error distance) and 6.591908 × 10−5 (L1 gradient error distance).
Conclusion
Machine learning techniques for monocular depth prediction is studied to infer accurate depth maps from a single-color arthroscopic video frame. Moreover, the study integrates segmentation model hence, 3D segmented maps are inferred that provides extended perception ability and tissue awareness.