用于自动驾驶汽车分割和单目深度估计的多任务视觉转换器

IF 4.6 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Open Journal of Intelligent Transportation Systems Pub Date : 2023-11-28 DOI:10.1109/OJITS.2023.3335648

Durga Prasad Bavirisetti;Herman Ryen Martinsen;Gabriel Hanssen Kiss;Frank Lindseth

{"title":"用于自动驾驶汽车分割和单目深度估计的多任务视觉转换器","authors":"Durga Prasad Bavirisetti;Herman Ryen Martinsen;Gabriel Hanssen Kiss;Frank Lindseth","doi":"10.1109/OJITS.2023.3335648","DOIUrl":null,"url":null,"abstract":"In this paper, we investigate the use of Vision Transformers for processing and understanding visual data in an autonomous driving setting. Specifically, we explore the use of Vision Transformers for semantic segmentation and monocular depth estimation using only a single image as input. We present state-of-the-art Vision Transformers for these tasks and combine them into a multitask model. Through multiple experiments on four different street image datasets, we demonstrate that the multitask approach significantly reduces inference time while maintaining high accuracy for both tasks. Additionally, we show that changing the size of the Transformer-based backbone can be used as a trade-off between inference speed and accuracy. Furthermore, we investigate the use of synthetic data for pre-training and show that it effectively increases the accuracy of the model when real-world data is limited.","PeriodicalId":100631,"journal":{"name":"IEEE Open Journal of Intelligent Transportation Systems","volume":"4 ","pages":"909-928"},"PeriodicalIF":4.6000,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10330677","citationCount":"0","resultStr":"{\"title\":\"A Multi-Task Vision Transformer for Segmentation and Monocular Depth Estimation for Autonomous Vehicles\",\"authors\":\"Durga Prasad Bavirisetti;Herman Ryen Martinsen;Gabriel Hanssen Kiss;Frank Lindseth\",\"doi\":\"10.1109/OJITS.2023.3335648\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we investigate the use of Vision Transformers for processing and understanding visual data in an autonomous driving setting. Specifically, we explore the use of Vision Transformers for semantic segmentation and monocular depth estimation using only a single image as input. We present state-of-the-art Vision Transformers for these tasks and combine them into a multitask model. Through multiple experiments on four different street image datasets, we demonstrate that the multitask approach significantly reduces inference time while maintaining high accuracy for both tasks. Additionally, we show that changing the size of the Transformer-based backbone can be used as a trade-off between inference speed and accuracy. Furthermore, we investigate the use of synthetic data for pre-training and show that it effectively increases the accuracy of the model when real-world data is limited.\",\"PeriodicalId\":100631,\"journal\":{\"name\":\"IEEE Open Journal of Intelligent Transportation Systems\",\"volume\":\"4 \",\"pages\":\"909-928\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2023-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10330677\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Open Journal of Intelligent Transportation Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10330677/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of Intelligent Transportation Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10330677/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们研究了在自动驾驶环境中使用视觉变换器处理和理解视觉数据的方法。具体来说，我们探索了如何使用视觉变换器进行语义分割和单眼深度估计，只使用单张图像作为输入。我们针对这些任务介绍了最先进的视觉变换器，并将它们组合成一个多任务模型。通过在四个不同的街道图像数据集上进行多次实验，我们证明了多任务方法可显著缩短推理时间，同时保持这两项任务的高准确性。此外，我们还证明，改变基于 Transformer 的骨干网的大小可以在推理速度和准确性之间进行权衡。此外，我们还研究了使用合成数据进行预训练的方法，结果表明在真实世界数据有限的情况下，这种方法能有效提高模型的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Multi-Task Vision Transformer for Segmentation and Monocular Depth Estimation for Autonomous Vehicles

In this paper, we investigate the use of Vision Transformers for processing and understanding visual data in an autonomous driving setting. Specifically, we explore the use of Vision Transformers for semantic segmentation and monocular depth estimation using only a single image as input. We present state-of-the-art Vision Transformers for these tasks and combine them into a multitask model. Through multiple experiments on four different street image datasets, we demonstrate that the multitask approach significantly reduces inference time while maintaining high accuracy for both tasks. Additionally, we show that changing the size of the Transformer-based backbone can be used as a trade-off between inference speed and accuracy. Furthermore, we investigate the use of synthetic data for pre-training and show that it effectively increases the accuracy of the model when real-world data is limited.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊