Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide application in autonomous driving. However, due to the differences between roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV solution in roadside. This paper systematically analyzes the key challenges in multi-camera BEV perception for roadside scenarios compared to vehicle-side. These challenges include the diversity in camera poses, the uncertainty in Camera numbers, the sparsity in perception regions, and the ambiguity in orientation angles. In response, we introduce RopeBEV, the first dense multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the training balance issues caused by diverse camera poses. By incorporating CamMask and ROIMask (Region of Interest Mask), it supports variable camera numbers and sparse perception, respectively. Finally, camera rotation embedding is utilized to resolve orientation ambiguity. Our method ranks 1st on the real-world highway dataset RoScenes and demonstrates its practical value on a private urban dataset that covers more than 50 intersections and 600 cameras.
{"title":"RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View","authors":"Jinrang Jia, Guangqi Yi, Yifeng Shi","doi":"arxiv-2409.11706","DOIUrl":"https://doi.org/arxiv-2409.11706","url":null,"abstract":"Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide\u0000application in autonomous driving. However, due to the differences between\u0000roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV\u0000solution in roadside. This paper systematically analyzes the key challenges in\u0000multi-camera BEV perception for roadside scenarios compared to vehicle-side.\u0000These challenges include the diversity in camera poses, the uncertainty in\u0000Camera numbers, the sparsity in perception regions, and the ambiguity in\u0000orientation angles. In response, we introduce RopeBEV, the first dense\u0000multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the\u0000training balance issues caused by diverse camera poses. By incorporating\u0000CamMask and ROIMask (Region of Interest Mask), it supports variable camera\u0000numbers and sparse perception, respectively. Finally, camera rotation embedding\u0000is utilized to resolve orientation ambiguity. Our method ranks 1st on the\u0000real-world highway dataset RoScenes and demonstrates its practical value on a\u0000private urban dataset that covers more than 50 intersections and 600 cameras.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gait recognition is a rapidly progressing technique for the remote identification of individuals. Prior research predominantly employing 2D sensors to gather gait data has achieved notable advancements; nonetheless, they have unavoidably neglected the influence of 3D dynamic characteristics on recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly captures 3D spatial features but also diminishes the impact of lighting conditions while ensuring privacy protection.The essence of the problem lies in how to effectively extract discriminative 3D dynamic representation from point clouds.In this paper, we proposes a method named SpheriGait for extracting and enhancing dynamic features from point clouds for Lidar-based gait recognition. Specifically, it substitutes the conventional point cloud plane projection method with spherical projection to augment the perception of dynamic feature.Additionally, a network block named DAM-L is proposed to extract gait cues from the projected point cloud data. We conducted extensive experiments and the results demonstrated the SpheriGait achieved state-of-the-art performance on the SUSTech1K dataset, and verified that the spherical projection method can serve as a universal data preprocessing technique to enhance the performance of other LiDAR-based gait recognition methods, exhibiting exceptional flexibility and practicality.
{"title":"SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition","authors":"Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao","doi":"arxiv-2409.11869","DOIUrl":"https://doi.org/arxiv-2409.11869","url":null,"abstract":"Gait recognition is a rapidly progressing technique for the remote\u0000identification of individuals. Prior research predominantly employing 2D\u0000sensors to gather gait data has achieved notable advancements; nonetheless,\u0000they have unavoidably neglected the influence of 3D dynamic characteristics on\u0000recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly\u0000captures 3D spatial features but also diminishes the impact of lighting\u0000conditions while ensuring privacy protection.The essence of the problem lies in\u0000how to effectively extract discriminative 3D dynamic representation from point\u0000clouds.In this paper, we proposes a method named SpheriGait for extracting and\u0000enhancing dynamic features from point clouds for Lidar-based gait recognition.\u0000Specifically, it substitutes the conventional point cloud plane projection\u0000method with spherical projection to augment the perception of dynamic\u0000feature.Additionally, a network block named DAM-L is proposed to extract gait\u0000cues from the projected point cloud data. We conducted extensive experiments\u0000and the results demonstrated the SpheriGait achieved state-of-the-art\u0000performance on the SUSTech1K dataset, and verified that the spherical\u0000projection method can serve as a universal data preprocessing technique to\u0000enhance the performance of other LiDAR-based gait recognition methods,\u0000exhibiting exceptional flexibility and practicality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak
Identification of suspects based on partial and smudged fingerprints, commonly referred to as fingermarks or latent fingerprints, presents a significant challenge in the field of fingerprint recognition. Although fixed-length embeddings have shown effectiveness in recognising rolled and slap fingerprints, the methods for matching latent fingerprints have primarily centred around local minutiae-based embeddings, failing to fully exploit global representations for matching purposes. Consequently, enhancing latent fingerprints becomes critical to ensuring robust identification for forensic investigations. Current approaches often prioritise restoring ridge patterns, overlooking the fine-macroeconomic details crucial for accurate fingerprint recognition. To address this, we propose a novel approach that uses generative adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE) through a structured approach to fingerprint generation. By directly optimising the minutiae information during the generation process, the model produces enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth instances. This leads to a significant improvement in identification performance. Our framework integrates minutiae locations and orientation fields, ensuring the preservation of both local and structural fingerprint features. Extensive evaluations conducted on two publicly available datasets demonstrate our method's dominance over existing state-of-the-art techniques, highlighting its potential to significantly enhance latent fingerprint recognition accuracy in forensic applications.
{"title":"Latent fingerprint enhancement for accurate minutiae detection","authors":"Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak","doi":"arxiv-2409.11802","DOIUrl":"https://doi.org/arxiv-2409.11802","url":null,"abstract":"Identification of suspects based on partial and smudged fingerprints,\u0000commonly referred to as fingermarks or latent fingerprints, presents a\u0000significant challenge in the field of fingerprint recognition. Although\u0000fixed-length embeddings have shown effectiveness in recognising rolled and slap\u0000fingerprints, the methods for matching latent fingerprints have primarily\u0000centred around local minutiae-based embeddings, failing to fully exploit global\u0000representations for matching purposes. Consequently, enhancing latent\u0000fingerprints becomes critical to ensuring robust identification for forensic\u0000investigations. Current approaches often prioritise restoring ridge patterns,\u0000overlooking the fine-macroeconomic details crucial for accurate fingerprint\u0000recognition. To address this, we propose a novel approach that uses generative\u0000adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE)\u0000through a structured approach to fingerprint generation. By directly optimising\u0000the minutiae information during the generation process, the model produces\u0000enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth\u0000instances. This leads to a significant improvement in identification\u0000performance. Our framework integrates minutiae locations and orientation\u0000fields, ensuring the preservation of both local and structural fingerprint\u0000features. Extensive evaluations conducted on two publicly available datasets\u0000demonstrate our method's dominance over existing state-of-the-art techniques,\u0000highlighting its potential to significantly enhance latent fingerprint\u0000recognition accuracy in forensic applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure initially designed for the treatment of early gastric cancer but is now widely used for various gastrointestinal lesions. Computer-assisted Surgery systems have played a crucial role in improving the precision and safety of ESD procedures, however, their effectiveness is limited by the accurate recognition of surgical phases. The intricate nature of ESD, with different lesion characteristics and tissue structures, presents challenges for real-time surgical phase recognition algorithms. Existing surgical phase recognition algorithms struggle to efficiently capture temporal contexts in video-based scenarios, leading to insufficient performance. To address these issues, we propose SPRMamba, a novel Mamba-based framework for ESD surgical phase recognition. SPRMamba leverages the strengths of Mamba for long-term temporal modeling while introducing the Scaled Residual TranMamba block to enhance the capture of fine-grained details, overcoming the limitations of traditional temporal models like Temporal Convolutional Networks and Transformers. Moreover, a Temporal Sample Strategy is introduced to accelerate the processing, which is essential for real-time phase recognition in clinical settings. Extensive testing on the ESD385 dataset and the cholecystectomy Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art methods and exhibits greater robustness across various surgical phase recognition tasks.
{"title":"SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba","authors":"Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian","doi":"arxiv-2409.12108","DOIUrl":"https://doi.org/arxiv-2409.12108","url":null,"abstract":"Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure\u0000initially designed for the treatment of early gastric cancer but is now widely\u0000used for various gastrointestinal lesions. Computer-assisted Surgery systems\u0000have played a crucial role in improving the precision and safety of ESD\u0000procedures, however, their effectiveness is limited by the accurate recognition\u0000of surgical phases. The intricate nature of ESD, with different lesion\u0000characteristics and tissue structures, presents challenges for real-time\u0000surgical phase recognition algorithms. Existing surgical phase recognition\u0000algorithms struggle to efficiently capture temporal contexts in video-based\u0000scenarios, leading to insufficient performance. To address these issues, we\u0000propose SPRMamba, a novel Mamba-based framework for ESD surgical phase\u0000recognition. SPRMamba leverages the strengths of Mamba for long-term temporal\u0000modeling while introducing the Scaled Residual TranMamba block to enhance the\u0000capture of fine-grained details, overcoming the limitations of traditional\u0000temporal models like Temporal Convolutional Networks and Transformers.\u0000Moreover, a Temporal Sample Strategy is introduced to accelerate the\u0000processing, which is essential for real-time phase recognition in clinical\u0000settings. Extensive testing on the ESD385 dataset and the cholecystectomy\u0000Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art\u0000methods and exhibits greater robustness across various surgical phase\u0000recognition tasks.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding the anisotropic reflectance of complex Earth surfaces from satellite imagery is crucial for numerous applications. Neural radiance fields (NeRF) have become popular as a machine learning technique capable of deducing the bidirectional reflectance distribution function (BRDF) of a scene from multiple images. However, prior research has largely concentrated on applying NeRF to close-range imagery, estimating basic Microfacet BRDF models, which fall short for many Earth surfaces. Moreover, high-quality NeRFs generally require several images captured simultaneously, a rare occurrence in satellite imaging. To address these limitations, we propose BRDF-NeRF, developed to explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical BRDF model commonly employed in remote sensing. We assess our approach using two datasets: (1) Djibouti, captured in a single epoch at varying viewing angles with a fixed Sun position, and (2) Lanzhou, captured over multiple epochs with different viewing angles and Sun positions. Our results, based on only three to four satellite images for training, demonstrate that BRDF-NeRF can effectively synthesize novel views from directions far removed from the training data and produce high-quality digital surface models (DSMs).
{"title":"BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling","authors":"Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger","doi":"arxiv-2409.12014","DOIUrl":"https://doi.org/arxiv-2409.12014","url":null,"abstract":"Understanding the anisotropic reflectance of complex Earth surfaces from\u0000satellite imagery is crucial for numerous applications. Neural radiance fields\u0000(NeRF) have become popular as a machine learning technique capable of deducing\u0000the bidirectional reflectance distribution function (BRDF) of a scene from\u0000multiple images. However, prior research has largely concentrated on applying\u0000NeRF to close-range imagery, estimating basic Microfacet BRDF models, which\u0000fall short for many Earth surfaces. Moreover, high-quality NeRFs generally\u0000require several images captured simultaneously, a rare occurrence in satellite\u0000imaging. To address these limitations, we propose BRDF-NeRF, developed to\u0000explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical\u0000BRDF model commonly employed in remote sensing. We assess our approach using\u0000two datasets: (1) Djibouti, captured in a single epoch at varying viewing\u0000angles with a fixed Sun position, and (2) Lanzhou, captured over multiple\u0000epochs with different viewing angles and Sun positions. Our results, based on\u0000only three to four satellite images for training, demonstrate that BRDF-NeRF\u0000can effectively synthesize novel views from directions far removed from the\u0000training data and produce high-quality digital surface models (DSMs).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"22 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised video semantic compression (UVSC), i.e., compressing videos to better support various analysis tasks, has recently garnered attention. However, the semantic richness of previous methods remains limited, due to the single semantic learning objective, limited training data, etc. To address this, we propose to boost the UVSC task by absorbing the off-the-shelf rich semantics from VFMs. Specifically, we introduce a VFMs-shared semantic alignment layer, complemented by VFM-specific prompts, to flexibly align semantics between the compressed video and various VFMs. This allows different VFMs to collaboratively build a mutually-enhanced semantic space, guiding the learning of the compression model. Moreover, we introduce a dynamic trajectory-based inter-frame compression scheme, which first estimates the semantic trajectory based on the historical content, and then traverses along the trajectory to predict the future semantics as the coding context. This reduces the overall bitcost of the system, further improving the compression efficiency. Our approach outperforms previous coding methods on three mainstream tasks and six datasets.
{"title":"Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression","authors":"Yuan Tian, Guo Lu, Guangtao Zhai","doi":"arxiv-2409.11718","DOIUrl":"https://doi.org/arxiv-2409.11718","url":null,"abstract":"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\u0000better support various analysis tasks, has recently garnered attention.\u0000However, the semantic richness of previous methods remains limited, due to the\u0000single semantic learning objective, limited training data, etc. To address\u0000this, we propose to boost the UVSC task by absorbing the off-the-shelf rich\u0000semantics from VFMs. Specifically, we introduce a VFMs-shared semantic\u0000alignment layer, complemented by VFM-specific prompts, to flexibly align\u0000semantics between the compressed video and various VFMs. This allows different\u0000VFMs to collaboratively build a mutually-enhanced semantic space, guiding the\u0000learning of the compression model. Moreover, we introduce a dynamic\u0000trajectory-based inter-frame compression scheme, which first estimates the\u0000semantic trajectory based on the historical content, and then traverses along\u0000the trajectory to predict the future semantics as the coding context. This\u0000reduces the overall bitcost of the system, further improving the compression\u0000efficiency. Our approach outperforms previous coding methods on three\u0000mainstream tasks and six datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.
{"title":"Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance","authors":"Jaehoon Joo, Taejin Jeong, Seongjae Hwang","doi":"arxiv-2409.12099","DOIUrl":"https://doi.org/arxiv-2409.12099","url":null,"abstract":"Understanding how humans process visual information is one of the crucial\u0000steps for unraveling the underlying mechanism of brain activity. Recently, this\u0000curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI\u0000data from visual stimuli, it aims to reconstruct the corresponding visual\u0000stimuli. Surprisingly, leveraging powerful generative models such as the Latent\u0000Diffusion Model (LDM) has shown promising results in reconstructing complex\u0000visual stimuli such as high-resolution natural images from vision datasets.\u0000Despite the impressive structural fidelity of these reconstructions, they often\u0000lack details of small objects, ambiguous shapes, and semantic nuances.\u0000Consequently, the incorporation of additional semantic knowledge, beyond mere\u0000visuals, becomes imperative. In light of this, we exploit how modern LDMs\u0000effectively incorporate multi-modal guidance (text guidance, visual guidance,\u0000and image layout) for structurally and semantically plausible image\u0000generations. Specifically, inspired by the two-streams hypothesis suggesting\u0000that perceptual and semantic information are processed in different brain\u0000regions, our framework, Brain-Streams, maps fMRI signals from these brain\u0000regions to appropriate embeddings. That is, by extracting textual guidance from\u0000semantic information regions and visual guidance from perceptual information\u0000regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We\u0000validate the reconstruction ability of Brain-Streams both quantitatively and\u0000qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI\u0000data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
3D Multi-Object Tracking (MOT) obtains significant performance improvements with the rapid advancements in 3D object detection, particularly in cost-effective multi-camera setups. However, the prevalent end-to-end training approach for multi-camera trackers results in detector-specific models, limiting their versatility. Moreover, current generic trackers overlook the unique features of multi-camera detectors, i.e., the unreliability of motion observations and the feasibility of visual information. To address these challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors. Following the Tracking-By-Detection framework, RockTrack is compatible with various off-the-shelf detectors. RockTrack incorporates a confidence-guided preprocessing module to extract reliable motion and image observations from distinct representation spaces from a single detector. These observations are then fused in an association module that leverages geometric and appearance cues to minimize mismatches. The resulting matches are propagated through a staged estimation process, forming the basis for heuristic noise modeling. Additionally, we introduce a novel appearance similarity metric for explicitly characterizing object affinities in multi-camera settings. RockTrack achieves state-of-the-art performance on the nuScenes vision-only tracking leaderboard with 59.1% AMOTA while demonstrating impressive computational efficiency.
随着三维物体检测技术的快速发展,三维多目标跟踪(MOT)的性能得到了显著提高,特别是在成本效益高的多摄像头设置中。然而,多摄像头跟踪器普遍采用的端到端训练方法导致了特定于探测器的模型,限制了其通用性。此外,当前的通用跟踪器忽略了多摄像机探测器的独特特征,即运动观测的不稳定性和视觉信息的可行性。为了应对这些挑战,我们提出了一种适用于多摄像头探测器的 3D MOT 方法--RockTrack。RockTrack 采用置信度引导预处理模块,从单个探测器的不同表示空间中提取可靠的运动和图像观测值。然后将这些观察结果融合到一个关联模块中,该模块利用几何和外观线索来尽量减少不匹配。此外,我们还引入了一种新颖的外观相似度量,用于明确描述多摄像头环境下的物体亲和性。RockTrack 在 nuScenes 纯视觉跟踪排行榜上取得了最先进的性能,AMOTA 为 59.1%,同时显示出令人印象深刻的计算效率。
{"title":"RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework","authors":"Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui","doi":"arxiv-2409.11749","DOIUrl":"https://doi.org/arxiv-2409.11749","url":null,"abstract":"3D Multi-Object Tracking (MOT) obtains significant performance improvements\u0000with the rapid advancements in 3D object detection, particularly in\u0000cost-effective multi-camera setups. However, the prevalent end-to-end training\u0000approach for multi-camera trackers results in detector-specific models,\u0000limiting their versatility. Moreover, current generic trackers overlook the\u0000unique features of multi-camera detectors, i.e., the unreliability of motion\u0000observations and the feasibility of visual information. To address these\u0000challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors.\u0000Following the Tracking-By-Detection framework, RockTrack is compatible with\u0000various off-the-shelf detectors. RockTrack incorporates a confidence-guided\u0000preprocessing module to extract reliable motion and image observations from\u0000distinct representation spaces from a single detector. These observations are\u0000then fused in an association module that leverages geometric and appearance\u0000cues to minimize mismatches. The resulting matches are propagated through a\u0000staged estimation process, forming the basis for heuristic noise modeling.\u0000Additionally, we introduce a novel appearance similarity metric for explicitly\u0000characterizing object affinities in multi-camera settings. RockTrack achieves\u0000state-of-the-art performance on the nuScenes vision-only tracking leaderboard\u0000with 59.1% AMOTA while demonstrating impressive computational efficiency.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach
In practical use cases, polygonal mesh editing can be faster than generating new ones, but it can still be challenging and time-consuming for users. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. In this work, we propose LEMON, a mesh editing pipeline that combines neural deferred shading with localized mesh optimization. Our approach begins by identifying the most important vertices in the mesh for editing, utilizing a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. By using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline using the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.
{"title":"LEMON: Localized Editing with Mesh Optimization and Neural Shaders","authors":"Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach","doi":"arxiv-2409.12024","DOIUrl":"https://doi.org/arxiv-2409.12024","url":null,"abstract":"In practical use cases, polygonal mesh editing can be faster than generating\u0000new ones, but it can still be challenging and time-consuming for users.\u0000Existing solutions for this problem tend to focus on a single task, either\u0000geometry or novel view synthesis, which often leads to disjointed results\u0000between the mesh and view. In this work, we propose LEMON, a mesh editing\u0000pipeline that combines neural deferred shading with localized mesh\u0000optimization. Our approach begins by identifying the most important vertices in\u0000the mesh for editing, utilizing a segmentation model to focus on these key\u0000regions. Given multi-view images of an object, we optimize a neural shader and\u0000a polygonal mesh while extracting the normal map and the rendered image from\u0000each view. By using these outputs as conditioning data, we edit the input\u0000images with a text-to-image diffusion model and iteratively update our dataset\u0000while deforming the mesh. This process results in a polygonal mesh that is\u0000edited according to the given text instruction, preserving the geometric\u0000characteristics of the initial mesh while focusing on the most significant\u0000areas. We evaluate our pipeline using the DTU dataset, demonstrating that it\u0000generates finely-edited meshes more rapidly than the current state-of-the-art\u0000methods. We include our code and additional results in the supplementary\u0000material.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at url{https://anonymous.4open.science/r/mocoop-6387}
{"title":"Mixture of Prompt Learning for Vision Language Models","authors":"Yu Du, Tong Niu, Rong Zhao","doi":"arxiv-2409.12011","DOIUrl":"https://doi.org/arxiv-2409.12011","url":null,"abstract":"As powerful pre-trained vision-language models (VLMs) like CLIP gain\u0000prominence, numerous studies have attempted to combine VLMs for downstream\u0000tasks. Among these, prompt learning has been validated as an effective method\u0000for adapting to new tasks, which only requiring a small number of parameters.\u0000However, current prompt learning methods face two challenges: first, a single\u0000soft prompt struggles to capture the diverse styles and patterns within a\u0000dataset; second, fine-tuning soft prompts is prone to overfitting. To address\u0000these challenges, we propose a mixture of soft prompt learning method\u0000incorporating a routing module. This module is able to capture a dataset's\u0000varied styles and dynamically selects the most suitable prompts for each\u0000instance. Additionally, we introduce a novel gating mechanism to ensure the\u0000router selects prompts based on their similarity to hard prompt templates,\u0000which both retaining knowledge from hard prompts and improving selection\u0000accuracy. We also implement semantically grouped text-level supervision,\u0000initializing each soft prompt with the token embeddings of manually designed\u0000templates from its group and applied a contrastive loss between the resulted\u0000text feature and hard prompt encoded text feature. This supervision ensures\u0000that the text features derived from soft prompts remain close to those from\u0000their corresponding hard prompts, preserving initial knowledge and mitigating\u0000overfitting. Our method has been validated on 11 datasets, demonstrating\u0000evident improvements in few-shot learning, domain generalization, and\u0000base-to-new generalization scenarios compared to existing baselines. The code\u0000will be available at url{https://anonymous.4open.science/r/mocoop-6387}","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}