{"title":"VLFSE: Enhancing visual tracking through visual language fusion and state update evaluator","authors":"Fuchao Yang , Mingkai Jiang , Qiaohong Hao , Xiaolei Zhao , Qinghe Feng","doi":"10.1016/j.mlwa.2024.100588","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, visual tracking algorithms have achieved impressive results by combining dynamic templates. However, the instability of visual images and the incorrect timing of template updates lead to decreased tracking accuracy and stability in intricate scenarios. To address these issues, we propose a visual tracking algorithm through visual language fusion and a state update evaluator (VLFSE). Specifically, our approach introduces a multimodal attention mechanism that uses self-attention to mine and integrate information from diverse sources effectively. This mechanism ensures a richer, context-aware representation of the target, enabling more accurate tracking even in complex scenes. Moreover, we recognize the critical need for precise template updates to maintain tracking accuracy over time. To this end, we develop a state update evaluator, a component trained online to assess the necessity and timing of template updates accurately. This evaluator acts as a safeguard, preventing erroneous updates and ensuring the tracker adapts optimally to changes in the target’s appearance. The experimental results on challenging visual language tracking datasets demonstrate our tracker’s superior performance, showcasing its adaptability and accuracy in complex tracking scenarios.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"18 ","pages":"Article 100588"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827024000641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, visual tracking algorithms have achieved impressive results by combining dynamic templates. However, the instability of visual images and the incorrect timing of template updates lead to decreased tracking accuracy and stability in intricate scenarios. To address these issues, we propose a visual tracking algorithm through visual language fusion and a state update evaluator (VLFSE). Specifically, our approach introduces a multimodal attention mechanism that uses self-attention to mine and integrate information from diverse sources effectively. This mechanism ensures a richer, context-aware representation of the target, enabling more accurate tracking even in complex scenes. Moreover, we recognize the critical need for precise template updates to maintain tracking accuracy over time. To this end, we develop a state update evaluator, a component trained online to assess the necessity and timing of template updates accurately. This evaluator acts as a safeguard, preventing erroneous updates and ensuring the tracker adapts optimally to changes in the target’s appearance. The experimental results on challenging visual language tracking datasets demonstrate our tracker’s superior performance, showcasing its adaptability and accuracy in complex tracking scenarios.