{"title":"LV2DMOT: Language and Visual Multimodal Feature Learning for Multiobject Tracking","authors":"Ru Hong;Zeyu Cai;Jiming Yang;Feipeng Da","doi":"10.1109/JSEN.2024.3519903","DOIUrl":null,"url":null,"abstract":"Multiobject tracking (MOT) aims to associate objects of the same identity across video frames, with robust similarity measurement being crucial for maintaining tracking performance. However, the current inefficient integration of motion and appearance cues often leads to tracking failures in challenging scenarios, such as occlusions and missed detections. In this article, we introduce LV2DMOT, a tracker that employs a novel paradigm for integrating motion and appearance cues through language and visual multimodal feature learning, thereby generating more distinctive data association similarities. We propose three key techniques: 1) a text-matching task between tracking trajectories and candidate detections. This method uses text encoding of detection geometric information combined with a temporal model, Mamba, to extract temporal motion features of trajectories, enabling more accurate motion similarity calculations; 2) a multimodal, multilevel feature fusion model that integrates motion and appearance features via a cross-modal learning mechanism, resulting in more robust fused similarities; and 3) a learnable temporal attention model for trajectory appearance feature updates, which effectively aggregates historical visual features to improve the representational ability of trajectory appearance features, employing k-medoids for feature selection. Extensive experiments on the MOT17 and MOT20 datasets demonstrate that our method achieves state-of-the-art (SOTA) tracking performance.","PeriodicalId":447,"journal":{"name":"IEEE Sensors Journal","volume":"25 4","pages":"7482-7495"},"PeriodicalIF":4.3000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Sensors Journal","FirstCategoryId":"103","ListUrlMain":"https://ieeexplore.ieee.org/document/10832530/","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Multiobject tracking (MOT) aims to associate objects of the same identity across video frames, with robust similarity measurement being crucial for maintaining tracking performance. However, the current inefficient integration of motion and appearance cues often leads to tracking failures in challenging scenarios, such as occlusions and missed detections. In this article, we introduce LV2DMOT, a tracker that employs a novel paradigm for integrating motion and appearance cues through language and visual multimodal feature learning, thereby generating more distinctive data association similarities. We propose three key techniques: 1) a text-matching task between tracking trajectories and candidate detections. This method uses text encoding of detection geometric information combined with a temporal model, Mamba, to extract temporal motion features of trajectories, enabling more accurate motion similarity calculations; 2) a multimodal, multilevel feature fusion model that integrates motion and appearance features via a cross-modal learning mechanism, resulting in more robust fused similarities; and 3) a learnable temporal attention model for trajectory appearance feature updates, which effectively aggregates historical visual features to improve the representational ability of trajectory appearance features, employing k-medoids for feature selection. Extensive experiments on the MOT17 and MOT20 datasets demonstrate that our method achieves state-of-the-art (SOTA) tracking performance.
期刊介绍:
The fields of interest of the IEEE Sensors Journal are the theory, design , fabrication, manufacturing and applications of devices for sensing and transducing physical, chemical and biological phenomena, with emphasis on the electronics and physics aspect of sensors and integrated sensors-actuators. IEEE Sensors Journal deals with the following:
-Sensor Phenomenology, Modelling, and Evaluation
-Sensor Materials, Processing, and Fabrication
-Chemical and Gas Sensors
-Microfluidics and Biosensors
-Optical Sensors
-Physical Sensors: Temperature, Mechanical, Magnetic, and others
-Acoustic and Ultrasonic Sensors
-Sensor Packaging
-Sensor Networks
-Sensor Applications
-Sensor Systems: Signals, Processing, and Interfaces
-Actuators and Sensor Power Systems
-Sensor Signal Processing for high precision and stability (amplification, filtering, linearization, modulation/demodulation) and under harsh conditions (EMC, radiation, humidity, temperature); energy consumption/harvesting
-Sensor Data Processing (soft computing with sensor data, e.g., pattern recognition, machine learning, evolutionary computation; sensor data fusion, processing of wave e.g., electromagnetic and acoustic; and non-wave, e.g., chemical, gravity, particle, thermal, radiative and non-radiative sensor data, detection, estimation and classification based on sensor data)
-Sensors in Industrial Practice