Panoramic driving perception requires robust and efficient context understanding, which requires simultaneous semantic and instance segmentation. This paper proposes U-MobileViT, a lightweight backbone network designed to address this challenge. Our architecture combines the advantages of MobileViT, a family of Transformer-based models with high accuracy and fast processing speed, with the image segmentation structure of the U-Net model, facilitating multiscale feature fusion and accurate localization. U-MobileViT efficiently combines local and global spatial information by utilizing MobileViT Blocks with Separable-Attention layers, resulting in a computationally lightweight yet effective architecture, while the U-Net structure enables efficient integration of features from different levels of the hierarchy. This synergistic combination enables the generation of rich, context-aware feature maps that are critical for accurate panoramic segmentation. Through extensive experiments on the challenging BDD100K driving dataset, we demonstrate that U-MobileViT achieves state-of-the-art performance in panoramic driving perception, outperforming existing lightweight models in both accuracy and inference speed. Our results demonstrate the potential of U-MobileViT as a robust and efficient backbone for real-time panoramic scene understanding in autonomous driving applications. Code is available at https://github.com/quyongkeomut/UMobileViT.
扫码关注我们
求助内容:
应助结果提醒方式:
