{"title":"Think Twice Before Recognizing: Large Multimodal Models for General Fine-grained Traffic Sign Recognition","authors":"Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama","doi":"arxiv-2409.01534","DOIUrl":null,"url":null,"abstract":"We propose a new strategy called think twice before recognizing to improve\nfine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is\ndifficult due to the complex road conditions, and existing approaches\nparticularly struggle with cross-country TSR when data is lacking. Our strategy\nachieves effective fine-grained TSR by stimulating the multiple-thinking\ncapability of large multimodal models (LMM). We introduce context,\ncharacteristic, and differential descriptions to design multiple thinking\nprocesses for the LMM. The context descriptions with center coordinate prompt\noptimization help the LMM to locate the target traffic sign in the original\nroad images containing multiple traffic signs and filter irrelevant answers\nthrough the proposed prior traffic sign hypothesis. The characteristic\ndescription is based on few-shot in-context learning of template traffic signs,\nwhich decreases the cross-domain difference and enhances the fine-grained\nrecognition capability of the LMM. The differential descriptions of similar\ntraffic signs optimize the multimodal thinking capability of the LMM. The\nproposed method is independent of training data and requires only simple and\nuniform instructions. We conducted extensive experiments on three benchmark\ndatasets and two real-world datasets from different countries, and the proposed\nmethod achieves state-of-the-art TSR results on all five datasets.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"34 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.01534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We propose a new strategy called think twice before recognizing to improve
fine-grained traffic sign recognition (TSR). Fine-grained TSR in the wild is
difficult due to the complex road conditions, and existing approaches
particularly struggle with cross-country TSR when data is lacking. Our strategy
achieves effective fine-grained TSR by stimulating the multiple-thinking
capability of large multimodal models (LMM). We introduce context,
characteristic, and differential descriptions to design multiple thinking
processes for the LMM. The context descriptions with center coordinate prompt
optimization help the LMM to locate the target traffic sign in the original
road images containing multiple traffic signs and filter irrelevant answers
through the proposed prior traffic sign hypothesis. The characteristic
description is based on few-shot in-context learning of template traffic signs,
which decreases the cross-domain difference and enhances the fine-grained
recognition capability of the LMM. The differential descriptions of similar
traffic signs optimize the multimodal thinking capability of the LMM. The
proposed method is independent of training data and requires only simple and
uniform instructions. We conducted extensive experiments on three benchmark
datasets and two real-world datasets from different countries, and the proposed
method achieves state-of-the-art TSR results on all five datasets.