arXiv - CS - Computer Vision and Pattern Recognition最新文献_第3页

RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View RopeBEV：鸟瞰式多摄像头路边感知网络

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11706

Jinrang Jia, Guangqi Yi, Yifeng Shi

Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wideapplication in autonomous driving. However, due to the differences betweenroadside and vehicle-side scenarios, there currently lacks a multi-camera BEVsolution in roadside. This paper systematically analyzes the key challenges inmulti-camera BEV perception for roadside scenarios compared to vehicle-side.These challenges include the diversity in camera poses, the uncertainty inCamera numbers, the sparsity in perception regions, and the ambiguity inorientation angles. In response, we introduce RopeBEV, the first densemulti-camera BEV approach. RopeBEV introduces BEV augmentation to address thetraining balance issues caused by diverse camera poses. By incorporatingCamMask and ROIMask (Region of Interest Mask), it supports variable cameranumbers and sparse perception, respectively. Finally, camera rotation embeddingis utilized to resolve orientation ambiguity. Our method ranks 1st on thereal-world highway dataset RoScenes and demonstrates its practical value on aprivate urban dataset that covers more than 50 intersections and 600 cameras.

鸟瞰（BEV）多摄像头感知方法在自动驾驶中得到了广泛应用。然而，由于路侧和车侧场景的不同，目前还缺乏路侧多摄像头 BEV 解决方案。本文系统分析了路边场景多摄像头 BEV 感知与车侧场景相比所面临的主要挑战，包括摄像头姿态的多样性、摄像头数量的不确定性、感知区域的稀疏性以及方向角的模糊性。为此，我们推出了首款密集多摄像头 BEV 方法 RopeBEV。RopeBEV 引入了 BEV 增强技术，以解决不同摄像机姿势造成的训练平衡问题。通过结合摄像头掩码（CamMask）和感兴趣区域掩码（ROIMask），它分别支持可变摄像头数量和稀疏感知。最后，利用摄像头旋转嵌入来解决方向模糊问题。我们的方法在真实世界高速公路数据集 RoScenes 上排名第一，并在涵盖 50 多个交叉路口和 600 多个摄像头的私人城市数据集上证明了其实用价值。

{"title":"RopeBEV: A Multi-Camera Roadside Perception Network in Bird's-Eye-View","authors":"Jinrang Jia, Guangqi Yi, Yifeng Shi","doi":"arxiv-2409.11706","DOIUrl":"https://doi.org/arxiv-2409.11706","url":null,"abstract":"Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide\u0000application in autonomous driving. However, due to the differences between\u0000roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV\u0000solution in roadside. This paper systematically analyzes the key challenges in\u0000multi-camera BEV perception for roadside scenarios compared to vehicle-side.\u0000These challenges include the diversity in camera poses, the uncertainty in\u0000Camera numbers, the sparsity in perception regions, and the ambiguity in\u0000orientation angles. In response, we introduce RopeBEV, the first dense\u0000multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the\u0000training balance issues caused by diverse camera poses. By incorporating\u0000CamMask and ROIMask (Region of Interest Mask), it supports variable camera\u0000numbers and sparse perception, respectively. Finally, camera rotation embedding\u0000is utilized to resolve orientation ambiguity. Our method ranks 1st on the\u0000real-world highway dataset RoScenes and demonstrates its practical value on a\u0000private urban dataset that covers more than 50 intersections and 600 cameras.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition SpheriGait：通过基于激光雷达的步态识别的球面投影丰富空间表示

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11869

Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao

Gait recognition is a rapidly progressing technique for the remoteidentification of individuals. Prior research predominantly employing 2Dsensors to gather gait data has achieved notable advancements; nonetheless,they have unavoidably neglected the influence of 3D dynamic characteristics onrecognition. Gait recognition utilizing LiDAR 3D point clouds not only directlycaptures 3D spatial features but also diminishes the impact of lightingconditions while ensuring privacy protection.The essence of the problem lies inhow to effectively extract discriminative 3D dynamic representation from pointclouds.In this paper, we proposes a method named SpheriGait for extracting andenhancing dynamic features from point clouds for Lidar-based gait recognition.Specifically, it substitutes the conventional point cloud plane projectionmethod with spherical projection to augment the perception of dynamicfeature.Additionally, a network block named DAM-L is proposed to extract gaitcues from the projected point cloud data. We conducted extensive experimentsand the results demonstrated the SpheriGait achieved state-of-the-artperformance on the SUSTech1K dataset, and verified that the sphericalprojection method can serve as a universal data preprocessing technique toenhance the performance of other LiDAR-based gait recognition methods,exhibiting exceptional flexibility and practicality.

步态识别是一项进展迅速的个人远程识别技术。之前的研究主要采用二维传感器来收集步态数据，取得了显著的进步，但不可避免地忽视了三维动态特征对识别的影响。利用激光雷达三维点云进行步态识别不仅能直接捕捉三维空间特征，还能减少光照条件的影响，同时确保隐私保护。本文提出了一种名为 SpheriGait 的方法，用于从点云中提取和增强动态特征，以实现基于激光雷达的步态识别。具体来说，该方法用球面投影取代了传统的点云平面投影法，以增强对动态特征的感知。我们进行了大量的实验，结果表明 SpheriGait 在 SUSTech1K 数据集上取得了最先进的性能，并验证了球面投影方法可以作为一种通用的数据预处理技术来提高其他基于激光雷达的步态识别方法的性能，表现出了非凡的灵活性和实用性。

{"title":"SpheriGait: Enriching Spatial Representation via Spherical Projection for LiDAR-based Gait Recognition","authors":"Yanxi Wang, Zhigang Chang, Chen Wu, Zihao Cheng, Hongmin Gao","doi":"arxiv-2409.11869","DOIUrl":"https://doi.org/arxiv-2409.11869","url":null,"abstract":"Gait recognition is a rapidly progressing technique for the remote\u0000identification of individuals. Prior research predominantly employing 2D\u0000sensors to gather gait data has achieved notable advancements; nonetheless,\u0000they have unavoidably neglected the influence of 3D dynamic characteristics on\u0000recognition. Gait recognition utilizing LiDAR 3D point clouds not only directly\u0000captures 3D spatial features but also diminishes the impact of lighting\u0000conditions while ensuring privacy protection.The essence of the problem lies in\u0000how to effectively extract discriminative 3D dynamic representation from point\u0000clouds.In this paper, we proposes a method named SpheriGait for extracting and\u0000enhancing dynamic features from point clouds for Lidar-based gait recognition.\u0000Specifically, it substitutes the conventional point cloud plane projection\u0000method with spherical projection to augment the perception of dynamic\u0000feature.Additionally, a network block named DAM-L is proposed to extract gait\u0000cues from the projected point cloud data. We conducted extensive experiments\u0000and the results demonstrated the SpheriGait achieved state-of-the-art\u0000performance on the SUSTech1K dataset, and verified that the spherical\u0000projection method can serve as a universal data preprocessing technique to\u0000enhance the performance of other LiDAR-based gait recognition methods,\u0000exhibiting exceptional flexibility and practicality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Latent fingerprint enhancement for accurate minutiae detection 增强潜伏指纹，实现精确的细节检测

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11802

Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak

Identification of suspects based on partial and smudged fingerprints,commonly referred to as fingermarks or latent fingerprints, presents asignificant challenge in the field of fingerprint recognition. Althoughfixed-length embeddings have shown effectiveness in recognising rolled and slapfingerprints, the methods for matching latent fingerprints have primarilycentred around local minutiae-based embeddings, failing to fully exploit globalrepresentations for matching purposes. Consequently, enhancing latentfingerprints becomes critical to ensuring robust identification for forensicinvestigations. Current approaches often prioritise restoring ridge patterns,overlooking the fine-macroeconomic details crucial for accurate fingerprintrecognition. To address this, we propose a novel approach that uses generativeadversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE)through a structured approach to fingerprint generation. By directly optimisingthe minutiae information during the generation process, the model producesenhanced latent fingerprints that exhibit exceptional fidelity to ground-truthinstances. This leads to a significant improvement in identificationperformance. Our framework integrates minutiae locations and orientationfields, ensuring the preservation of both local and structural fingerprintfeatures. Extensive evaluations conducted on two publicly available datasetsdemonstrate our method's dominance over existing state-of-the-art techniques,highlighting its potential to significantly enhance latent fingerprintrecognition accuracy in forensic applications.

根据部分指纹和污损指纹（通常称为指印或潜指纹）识别嫌疑人是指纹识别领域的一项重大挑战。虽然固定长度的嵌入式指纹识别技术在识别卷曲和拍打指纹方面显示出了有效性，但潜伏指纹的匹配方法主要集中在基于细节的局部嵌入式指纹识别技术上，未能充分利用全局代表性来达到匹配目的。因此，增强潜伏指纹对于确保法医调查中的可靠识别至关重要。目前的方法通常优先考虑恢复脊纹模式，而忽略了对准确识别指纹至关重要的微观经济细节。为了解决这个问题，我们提出了一种新方法，即使用生成式反向网络（GAN），通过结构化的指纹生成方法重新定义潜伏指纹增强（LFE）。通过在生成过程中直接优化细节信息，该模型生成的增强潜伏指纹与地面实况的保真度极高。这大大提高了识别性能。我们的框架整合了细部特征位置和方位场，确保保留局部和结构性指纹特征。在两个公开可用的数据集上进行的广泛评估表明，我们的方法优于现有的最先进技术，突出了其在法医应用中显著提高潜伏指纹识别准确性的潜力。

{"title":"Latent fingerprint enhancement for accurate minutiae detection","authors":"Abdul Wahab, Tariq Mahmood Khan, Shahzaib Iqbal, Bandar AlShammari, Bandar Alhaqbani, Imran Razzak","doi":"arxiv-2409.11802","DOIUrl":"https://doi.org/arxiv-2409.11802","url":null,"abstract":"Identification of suspects based on partial and smudged fingerprints,\u0000commonly referred to as fingermarks or latent fingerprints, presents a\u0000significant challenge in the field of fingerprint recognition. Although\u0000fixed-length embeddings have shown effectiveness in recognising rolled and slap\u0000fingerprints, the methods for matching latent fingerprints have primarily\u0000centred around local minutiae-based embeddings, failing to fully exploit global\u0000representations for matching purposes. Consequently, enhancing latent\u0000fingerprints becomes critical to ensuring robust identification for forensic\u0000investigations. Current approaches often prioritise restoring ridge patterns,\u0000overlooking the fine-macroeconomic details crucial for accurate fingerprint\u0000recognition. To address this, we propose a novel approach that uses generative\u0000adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE)\u0000through a structured approach to fingerprint generation. By directly optimising\u0000the minutiae information during the generation process, the model produces\u0000enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth\u0000instances. This leads to a significant improvement in identification\u0000performance. Our framework integrates minutiae locations and orientation\u0000fields, ensuring the preservation of both local and structural fingerprint\u0000features. Extensive evaluations conducted on two publicly available datasets\u0000demonstrate our method's dominance over existing state-of-the-art techniques,\u0000highlighting its potential to significantly enhance latent fingerprint\u0000recognition accuracy in forensic applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba SPRMamba：使用 Mamba 进行内窥镜粘膜下剥离的手术阶段识别

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12108

Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian

Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedureinitially designed for the treatment of early gastric cancer but is now widelyused for various gastrointestinal lesions. Computer-assisted Surgery systemshave played a crucial role in improving the precision and safety of ESDprocedures, however, their effectiveness is limited by the accurate recognitionof surgical phases. The intricate nature of ESD, with different lesioncharacteristics and tissue structures, presents challenges for real-timesurgical phase recognition algorithms. Existing surgical phase recognitionalgorithms struggle to efficiently capture temporal contexts in video-basedscenarios, leading to insufficient performance. To address these issues, wepropose SPRMamba, a novel Mamba-based framework for ESD surgical phaserecognition. SPRMamba leverages the strengths of Mamba for long-term temporalmodeling while introducing the Scaled Residual TranMamba block to enhance thecapture of fine-grained details, overcoming the limitations of traditionaltemporal models like Temporal Convolutional Networks and Transformers.Moreover, a Temporal Sample Strategy is introduced to accelerate theprocessing, which is essential for real-time phase recognition in clinicalsettings. Extensive testing on the ESD385 dataset and the cholecystectomyCholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-artmethods and exhibits greater robustness across various surgical phaserecognition tasks.

内镜黏膜下剥离术（ESD）是一种微创手术，最初设计用于治疗早期胃癌，现在已被广泛用于治疗各种胃肠道病变。计算机辅助手术系统在提高 ESD 手术的精确性和安全性方面发挥了重要作用，但其有效性受到手术阶段准确识别的限制。ESD 的性质错综复杂，病变特征和组织结构各不相同，这给实时手术阶段识别算法带来了挑战。现有的手术阶段识别算法难以在基于视频的场景中有效捕捉时间背景，导致性能不足。为了解决这些问题，我们提出了基于 Mamba 的新型 ESD 手术相位识别框架 SPRMamba。SPRMamba 充分利用了 Mamba 在长期时空建模方面的优势，同时引入了 Scaled Residual TranMamba 块来增强对细粒度细节的捕捉，克服了传统时空模型（如时空卷积网络和变换器）的局限性。在ESD385数据集和胆囊切除术Cholec80数据集上进行的广泛测试表明，SPRMamba超越了现有的先进方法，在各种手术相位识别任务中表现出更强的鲁棒性。

{"title":"SPRMamba: Surgical Phase Recognition for Endoscopic Submucosal Dissection with Mamba","authors":"Xiangning Zhang, Jinnan Chen, Qingwei Zhang, Chengfeng Zhou, Zhengjie Zhang, Xiaobo Li, Dahong Qian","doi":"arxiv-2409.12108","DOIUrl":"https://doi.org/arxiv-2409.12108","url":null,"abstract":"Endoscopic Submucosal Dissection (ESD) is a minimally invasive procedure\u0000initially designed for the treatment of early gastric cancer but is now widely\u0000used for various gastrointestinal lesions. Computer-assisted Surgery systems\u0000have played a crucial role in improving the precision and safety of ESD\u0000procedures, however, their effectiveness is limited by the accurate recognition\u0000of surgical phases. The intricate nature of ESD, with different lesion\u0000characteristics and tissue structures, presents challenges for real-time\u0000surgical phase recognition algorithms. Existing surgical phase recognition\u0000algorithms struggle to efficiently capture temporal contexts in video-based\u0000scenarios, leading to insufficient performance. To address these issues, we\u0000propose SPRMamba, a novel Mamba-based framework for ESD surgical phase\u0000recognition. SPRMamba leverages the strengths of Mamba for long-term temporal\u0000modeling while introducing the Scaled Residual TranMamba block to enhance the\u0000capture of fine-grained details, overcoming the limitations of traditional\u0000temporal models like Temporal Convolutional Networks and Transformers.\u0000Moreover, a Temporal Sample Strategy is introduced to accelerate the\u0000processing, which is essential for real-time phase recognition in clinical\u0000settings. Extensive testing on the ESD385 dataset and the cholecystectomy\u0000Cholec80 dataset demonstrates that SPRMamba surpasses existing state-of-the-art\u0000methods and exhibits greater robustness across various surgical phase\u0000recognition tasks.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling BRDF-NeRF：利用光学卫星图像和 BRDF 建模的神经辐射场

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12014

Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger

Understanding the anisotropic reflectance of complex Earth surfaces fromsatellite imagery is crucial for numerous applications. Neural radiance fields(NeRF) have become popular as a machine learning technique capable of deducingthe bidirectional reflectance distribution function (BRDF) of a scene frommultiple images. However, prior research has largely concentrated on applyingNeRF to close-range imagery, estimating basic Microfacet BRDF models, whichfall short for many Earth surfaces. Moreover, high-quality NeRFs generallyrequire several images captured simultaneously, a rare occurrence in satelliteimaging. To address these limitations, we propose BRDF-NeRF, developed toexplicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empiricalBRDF model commonly employed in remote sensing. We assess our approach usingtwo datasets: (1) Djibouti, captured in a single epoch at varying viewingangles with a fixed Sun position, and (2) Lanzhou, captured over multipleepochs with different viewing angles and Sun positions. Our results, based ononly three to four satellite images for training, demonstrate that BRDF-NeRFcan effectively synthesize novel views from directions far removed from thetraining data and produce high-quality digital surface models (DSMs).

从卫星图像中了解复杂地球表面的各向异性反射率对许多应用都至关重要。神经辐射场（NeRF）作为一种机器学习技术，能够从多幅图像中推导出场景的双向反射分布函数（BRDF），因而广受欢迎。然而，之前的研究主要集中在将 NeRF 应用于近距离图像，估算基本的 Microfacet BRDF 模型，这对许多地球表面来说都是不够的。此外，高质量的 NeRF 通常需要同时捕获多幅图像，这在卫星成像中非常罕见。为了解决这些局限性，我们提出了 BRDF-NeRF，其开发目的是明确估算遥感中常用的半经验 BRDF 模型 Rahman-Pinty-Verstraete (RPV)。我们使用以下两个数据集对我们的方法进行了评估：(1) 吉布提数据集，该数据集是在太阳位置固定的情况下以不同视角拍摄的单次数据集；(2) 兰州数据集，该数据集是在不同视角和太阳位置的情况下拍摄的多个数据集。结果表明，BRDF-NeRF 能够有效地合成与训练数据相去甚远的新视图，并生成高质量的数字地表模型（DSM）。

{"title":"BRDF-NeRF: Neural Radiance Fields with Optical Satellite Images and BRDF Modelling","authors":"Lulin Zhang, Ewelina Rupnik, Tri Dung Nguyen, Stéphane Jacquemoud, Yann Klinger","doi":"arxiv-2409.12014","DOIUrl":"https://doi.org/arxiv-2409.12014","url":null,"abstract":"Understanding the anisotropic reflectance of complex Earth surfaces from\u0000satellite imagery is crucial for numerous applications. Neural radiance fields\u0000(NeRF) have become popular as a machine learning technique capable of deducing\u0000the bidirectional reflectance distribution function (BRDF) of a scene from\u0000multiple images. However, prior research has largely concentrated on applying\u0000NeRF to close-range imagery, estimating basic Microfacet BRDF models, which\u0000fall short for many Earth surfaces. Moreover, high-quality NeRFs generally\u0000require several images captured simultaneously, a rare occurrence in satellite\u0000imaging. To address these limitations, we propose BRDF-NeRF, developed to\u0000explicitly estimate the Rahman-Pinty-Verstraete (RPV) model, a semi-empirical\u0000BRDF model commonly employed in remote sensing. We assess our approach using\u0000two datasets: (1) Djibouti, captured in a single epoch at varying viewing\u0000angles with a fixed Sun position, and (2) Lanzhou, captured over multiple\u0000epochs with different viewing angles and Sun positions. Our results, based on\u0000only three to four satellite images for training, demonstrate that BRDF-NeRF\u0000can effectively synthesize novel views from directions far removed from the\u0000training data and produce high-quality digital surface models (DSMs).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"22 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression Free-VSC：来自视觉基础模型的自由语义，用于无监督视频语义压缩

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11718

Yuan Tian, Guo Lu, Guangtao Zhai

Unsupervised video semantic compression (UVSC), i.e., compressing videos tobetter support various analysis tasks, has recently garnered attention.However, the semantic richness of previous methods remains limited, due to thesingle semantic learning objective, limited training data, etc. To addressthis, we propose to boost the UVSC task by absorbing the off-the-shelf richsemantics from VFMs. Specifically, we introduce a VFMs-shared semanticalignment layer, complemented by VFM-specific prompts, to flexibly alignsemantics between the compressed video and various VFMs. This allows differentVFMs to collaboratively build a mutually-enhanced semantic space, guiding thelearning of the compression model. Moreover, we introduce a dynamictrajectory-based inter-frame compression scheme, which first estimates thesemantic trajectory based on the historical content, and then traverses alongthe trajectory to predict the future semantics as the coding context. Thisreduces the overall bitcost of the system, further improving the compressionefficiency. Our approach outperforms previous coding methods on threemainstream tasks and six datasets.

无监督视频语义压缩（UVSC），即压缩视频以更好地支持各种分析任务，最近引起了人们的关注。然而，由于语义学习目标单一、训练数据有限等原因，以往方法的语义丰富度仍然有限。为了解决这个问题，我们建议通过吸收 VFM 中现成的丰富语义来提升 UVSC 任务。具体来说，我们引入了一个 VFMs 共享语义对齐层，并辅以特定 VFM 的提示，以灵活地对齐压缩视频和各种 VFMs 之间的语义。这允许不同的 VFM 协同构建一个相互增强的语义空间，从而指导压缩模型的学习。此外，我们还引入了基于动态轨迹的帧间压缩方案，该方案首先根据历史内容估算语义轨迹，然后沿着轨迹预测作为编码上下文的未来语义。这降低了系统的总体比特成本，进一步提高了压缩效率。在三个主流任务和六个数据集上，我们的方法优于以前的编码方法。

{"title":"Free-VSC: Free Semantics from Visual Foundation Models for Unsupervised Video Semantic Compression","authors":"Yuan Tian, Guo Lu, Guangtao Zhai","doi":"arxiv-2409.11718","DOIUrl":"https://doi.org/arxiv-2409.11718","url":null,"abstract":"Unsupervised video semantic compression (UVSC), i.e., compressing videos to\u0000better support various analysis tasks, has recently garnered attention.\u0000However, the semantic richness of previous methods remains limited, due to the\u0000single semantic learning objective, limited training data, etc. To address\u0000this, we propose to boost the UVSC task by absorbing the off-the-shelf rich\u0000semantics from VFMs. Specifically, we introduce a VFMs-shared semantic\u0000alignment layer, complemented by VFM-specific prompts, to flexibly align\u0000semantics between the compressed video and various VFMs. This allows different\u0000VFMs to collaboratively build a mutually-enhanced semantic space, guiding the\u0000learning of the compression model. Moreover, we introduce a dynamic\u0000trajectory-based inter-frame compression scheme, which first estimates the\u0000semantic trajectory based on the historical content, and then traverses along\u0000the trajectory to predict the future semantics as the coding context. This\u0000reduces the overall bitcost of the system, further improving the compression\u0000efficiency. Our approach outperforms previous coding methods on three\u0000mainstream tasks and six datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance 脑流：多模态引导下的 fMRI 图像重构

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12099

Jaehoon Joo, Taejin Jeong, Seongjae Hwang

Understanding how humans process visual information is one of the crucialsteps for unraveling the underlying mechanism of brain activity. Recently, thiscuriosity has motivated the fMRI-to-image reconstruction task; given the fMRIdata from visual stimuli, it aims to reconstruct the corresponding visualstimuli. Surprisingly, leveraging powerful generative models such as the LatentDiffusion Model (LDM) has shown promising results in reconstructing complexvisual stimuli such as high-resolution natural images from vision datasets.Despite the impressive structural fidelity of these reconstructions, they oftenlack details of small objects, ambiguous shapes, and semantic nuances.Consequently, the incorporation of additional semantic knowledge, beyond merevisuals, becomes imperative. In light of this, we exploit how modern LDMseffectively incorporate multi-modal guidance (text guidance, visual guidance,and image layout) for structurally and semantically plausible imagegenerations. Specifically, inspired by the two-streams hypothesis suggestingthat perceptual and semantic information are processed in different brainregions, our framework, Brain-Streams, maps fMRI signals from these brainregions to appropriate embeddings. That is, by extracting textual guidance fromsemantic information regions and visual guidance from perceptual informationregions, Brain-Streams provides accurate multi-modal guidance to LDMs. Wevalidate the reconstruction ability of Brain-Streams both quantitatively andqualitatively on a real fMRI dataset comprising natural image stimuli and fMRIdata.

了解人类如何处理视觉信息是揭示大脑活动内在机制的关键步骤之一。最近，这种好奇心激发了从 fMRI 到图像的重建任务；给定来自视觉刺激的 fMRI 数据，其目的是重建相应的视觉刺激。令人惊讶的是，利用强大的生成模型，如潜在扩散模型（LatentDiffusion Model，LDM），在从视觉数据集重建复杂视觉刺激（如高分辨率自然图像）方面取得了令人鼓舞的成果。尽管这些重建的结构保真度令人印象深刻，但它们往往缺乏小物体、模糊形状和语义细微差别的细节。有鉴于此，我们探讨了现代 LDM 如何有效地结合多模式引导（文本引导、视觉引导和图像布局），以生成结构和语义上合理的图像。具体来说，双流假说认为感知信息和语义信息在不同的脑区进行处理，受此启发，我们的框架 "脑流"（Brain-Streams）将这些脑区的 fMRI 信号映射到适当的嵌入中。也就是说，通过从语义信息区域提取文本引导，从感知信息区域提取视觉引导，Brain-Streams 可为 LDM 提供准确的多模态引导。我们在由自然图像刺激和fMRI数据组成的真实fMRI数据集上对Brain-Streams的重构能力进行了定量和定性验证。

{"title":"Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance","authors":"Jaehoon Joo, Taejin Jeong, Seongjae Hwang","doi":"arxiv-2409.12099","DOIUrl":"https://doi.org/arxiv-2409.12099","url":null,"abstract":"Understanding how humans process visual information is one of the crucial\u0000steps for unraveling the underlying mechanism of brain activity. Recently, this\u0000curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI\u0000data from visual stimuli, it aims to reconstruct the corresponding visual\u0000stimuli. Surprisingly, leveraging powerful generative models such as the Latent\u0000Diffusion Model (LDM) has shown promising results in reconstructing complex\u0000visual stimuli such as high-resolution natural images from vision datasets.\u0000Despite the impressive structural fidelity of these reconstructions, they often\u0000lack details of small objects, ambiguous shapes, and semantic nuances.\u0000Consequently, the incorporation of additional semantic knowledge, beyond mere\u0000visuals, becomes imperative. In light of this, we exploit how modern LDMs\u0000effectively incorporate multi-modal guidance (text guidance, visual guidance,\u0000and image layout) for structurally and semantically plausible image\u0000generations. Specifically, inspired by the two-streams hypothesis suggesting\u0000that perceptual and semantic information are processed in different brain\u0000regions, our framework, Brain-Streams, maps fMRI signals from these brain\u0000regions to appropriate embeddings. That is, by extracting textual guidance from\u0000semantic information regions and visual guidance from perceptual information\u0000regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We\u0000validate the reconstruction ability of Brain-Streams both quantitatively and\u0000qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI\u0000data.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework RockTrack：3D Robust Multi-Camera-Ken 多目标跟踪框架

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11749

Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui

3D Multi-Object Tracking (MOT) obtains significant performance improvementswith the rapid advancements in 3D object detection, particularly incost-effective multi-camera setups. However, the prevalent end-to-end trainingapproach for multi-camera trackers results in detector-specific models,limiting their versatility. Moreover, current generic trackers overlook theunique features of multi-camera detectors, i.e., the unreliability of motionobservations and the feasibility of visual information. To address thesechallenges, we propose RockTrack, a 3D MOT method for multi-camera detectors.Following the Tracking-By-Detection framework, RockTrack is compatible withvarious off-the-shelf detectors. RockTrack incorporates a confidence-guidedpreprocessing module to extract reliable motion and image observations fromdistinct representation spaces from a single detector. These observations arethen fused in an association module that leverages geometric and appearancecues to minimize mismatches. The resulting matches are propagated through astaged estimation process, forming the basis for heuristic noise modeling.Additionally, we introduce a novel appearance similarity metric for explicitlycharacterizing object affinities in multi-camera settings. RockTrack achievesstate-of-the-art performance on the nuScenes vision-only tracking leaderboardwith 59.1% AMOTA while demonstrating impressive computational efficiency.

随着三维物体检测技术的快速发展，三维多目标跟踪（MOT）的性能得到了显著提高，特别是在成本效益高的多摄像头设置中。然而，多摄像头跟踪器普遍采用的端到端训练方法导致了特定于探测器的模型，限制了其通用性。此外，当前的通用跟踪器忽略了多摄像机探测器的独特特征，即运动观测的不稳定性和视觉信息的可行性。为了应对这些挑战，我们提出了一种适用于多摄像头探测器的 3D MOT 方法--RockTrack。RockTrack 采用置信度引导预处理模块，从单个探测器的不同表示空间中提取可靠的运动和图像观测值。然后将这些观察结果融合到一个关联模块中，该模块利用几何和外观线索来尽量减少不匹配。此外，我们还引入了一种新颖的外观相似度量，用于明确描述多摄像头环境下的物体亲和性。RockTrack 在 nuScenes 纯视觉跟踪排行榜上取得了最先进的性能，AMOTA 为 59.1%，同时显示出令人印象深刻的计算效率。

{"title":"RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework","authors":"Xiaoyu Li, Peidong Li, Lijun Zhao, Dedong Liu, Jinghan Gao, Xian Wu, Yitao Wu, Dixiao Cui","doi":"arxiv-2409.11749","DOIUrl":"https://doi.org/arxiv-2409.11749","url":null,"abstract":"3D Multi-Object Tracking (MOT) obtains significant performance improvements\u0000with the rapid advancements in 3D object detection, particularly in\u0000cost-effective multi-camera setups. However, the prevalent end-to-end training\u0000approach for multi-camera trackers results in detector-specific models,\u0000limiting their versatility. Moreover, current generic trackers overlook the\u0000unique features of multi-camera detectors, i.e., the unreliability of motion\u0000observations and the feasibility of visual information. To address these\u0000challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors.\u0000Following the Tracking-By-Detection framework, RockTrack is compatible with\u0000various off-the-shelf detectors. RockTrack incorporates a confidence-guided\u0000preprocessing module to extract reliable motion and image observations from\u0000distinct representation spaces from a single detector. These observations are\u0000then fused in an association module that leverages geometric and appearance\u0000cues to minimize mismatches. The resulting matches are propagated through a\u0000staged estimation process, forming the basis for heuristic noise modeling.\u0000Additionally, we introduce a novel appearance similarity metric for explicitly\u0000characterizing object affinities in multi-camera settings. RockTrack achieves\u0000state-of-the-art performance on the nuScenes vision-only tracking leaderboard\u0000with 59.1% AMOTA while demonstrating impressive computational efficiency.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LEMON: Localized Editing with Mesh Optimization and Neural Shaders LEMON：利用网格优化和神经着色器进行局部编辑

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12024

Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach

In practical use cases, polygonal mesh editing can be faster than generatingnew ones, but it can still be challenging and time-consuming for users.Existing solutions for this problem tend to focus on a single task, eithergeometry or novel view synthesis, which often leads to disjointed resultsbetween the mesh and view. In this work, we propose LEMON, a mesh editingpipeline that combines neural deferred shading with localized meshoptimization. Our approach begins by identifying the most important vertices inthe mesh for editing, utilizing a segmentation model to focus on these keyregions. Given multi-view images of an object, we optimize a neural shader anda polygonal mesh while extracting the normal map and the rendered image fromeach view. By using these outputs as conditioning data, we edit the inputimages with a text-to-image diffusion model and iteratively update our datasetwhile deforming the mesh. This process results in a polygonal mesh that isedited according to the given text instruction, preserving the geometriccharacteristics of the initial mesh while focusing on the most significantareas. We evaluate our pipeline using the DTU dataset, demonstrating that itgenerates finely-edited meshes more rapidly than the current state-of-the-artmethods. We include our code and additional results in the supplementarymaterial.

在实际应用案例中，多边形网格编辑可能比生成新网格更快，但对用户来说仍然具有挑战性且耗费时间。在这项工作中，我们提出了 LEMON，一种将神经延迟着色与局部网格优化相结合的网格编辑管道。我们的方法首先要确定网格中最重要的顶点进行编辑，利用分割模型将重点放在这些关键区域上。给定物体的多视图图像后，我们会优化神经着色器和多边形网格，同时从每个视图中提取法线贴图和渲染图像。利用这些输出作为条件数据，我们使用文本到图像的扩散模型编辑输入图像，并在变形网格的同时迭代更新数据集。这一过程的结果是根据给定的文本指令编辑多边形网格，保留初始网格的几何特征，同时关注最重要的区域。我们使用 DTU 数据集对我们的管道进行了评估，结果表明它比当前最先进的方法更快地生成经过精细编辑的网格。我们在补充材料中提供了我们的代码和其他结果。

{"title":"LEMON: Localized Editing with Mesh Optimization and Neural Shaders","authors":"Furkan Mert Algan, Umut Yazgan, Driton Salihu, Cem Eteke, Eckehard Steinbach","doi":"arxiv-2409.12024","DOIUrl":"https://doi.org/arxiv-2409.12024","url":null,"abstract":"In practical use cases, polygonal mesh editing can be faster than generating\u0000new ones, but it can still be challenging and time-consuming for users.\u0000Existing solutions for this problem tend to focus on a single task, either\u0000geometry or novel view synthesis, which often leads to disjointed results\u0000between the mesh and view. In this work, we propose LEMON, a mesh editing\u0000pipeline that combines neural deferred shading with localized mesh\u0000optimization. Our approach begins by identifying the most important vertices in\u0000the mesh for editing, utilizing a segmentation model to focus on these key\u0000regions. Given multi-view images of an object, we optimize a neural shader and\u0000a polygonal mesh while extracting the normal map and the rendered image from\u0000each view. By using these outputs as conditioning data, we edit the input\u0000images with a text-to-image diffusion model and iteratively update our dataset\u0000while deforming the mesh. This process results in a polygonal mesh that is\u0000edited according to the given text instruction, preserving the geometric\u0000characteristics of the initial mesh while focusing on the most significant\u0000areas. We evaluate our pipeline using the DTU dataset, demonstrating that it\u0000generates finely-edited meshes more rapidly than the current state-of-the-art\u0000methods. We include our code and additional results in the supplementary\u0000material.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mixture of Prompt Learning for Vision Language Models 视觉语言模型的混合提示学习

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12011

Yu Du, Tong Niu, Rong Zhao

As powerful pre-trained vision-language models (VLMs) like CLIP gainprominence, numerous studies have attempted to combine VLMs for downstreamtasks. Among these, prompt learning has been validated as an effective methodfor adapting to new tasks, which only requiring a small number of parameters.However, current prompt learning methods face two challenges: first, a singlesoft prompt struggles to capture the diverse styles and patterns within adataset; second, fine-tuning soft prompts is prone to overfitting. To addressthese challenges, we propose a mixture of soft prompt learning methodincorporating a routing module. This module is able to capture a dataset'svaried styles and dynamically selects the most suitable prompts for eachinstance. Additionally, we introduce a novel gating mechanism to ensure therouter selects prompts based on their similarity to hard prompt templates,which both retaining knowledge from hard prompts and improving selectionaccuracy. We also implement semantically grouped text-level supervision,initializing each soft prompt with the token embeddings of manually designedtemplates from its group and applied a contrastive loss between the resultedtext feature and hard prompt encoded text feature. This supervision ensuresthat the text features derived from soft prompts remain close to those fromtheir corresponding hard prompts, preserving initial knowledge and mitigatingoverfitting. Our method has been validated on 11 datasets, demonstratingevident improvements in few-shot learning, domain generalization, andbase-to-new generalization scenarios compared to existing baselines. The codewill be available at url{https://anonymous.4open.science/r/mocoop-6387}

随着功能强大的预训练视觉语言模型（VLM）（如 CLIP）逐渐占据主导地位，许多研究都尝试结合 VLM 来完成下游任务。然而，当前的提示学习方法面临两个挑战：首先，单一的软提示难以捕捉到数据集中的各种风格和模式；其次，对软提示进行微调容易造成过拟合。为了应对这些挑战，我们提出了一种混合软提示学习方法，其中包含一个路由模块。该模块能够捕捉数据集的不同风格，并为每个实例动态选择最合适的提示。此外，我们还引入了一种新颖的门控机制，确保路由器根据提示语与硬提示语模板的相似度来选择提示语，从而既保留了硬提示语中的知识，又提高了选择的准确性。我们还实现了语义分组文本级监督，用人工设计的模板中的标记嵌入来初始化每个软提示，并在生成的文本特征和硬提示编码的文本特征之间应用对比损失。这种监督确保了从软提示中得出的文本特征与其对应的硬提示中的文本特征保持接近，从而保留了初始知识并减少了过拟合。我们的方法已在 11 个数据集上进行了验证，与现有的基线相比，我们的方法在少量学习、领域泛化和从基础到新泛化等方面都有明显改善。代码可在 url{https://anonymous.4open.science/r/mocoop-6387} 上获取。

{"title":"Mixture of Prompt Learning for Vision Language Models","authors":"Yu Du, Tong Niu, Rong Zhao","doi":"arxiv-2409.12011","DOIUrl":"https://doi.org/arxiv-2409.12011","url":null,"abstract":"As powerful pre-trained vision-language models (VLMs) like CLIP gain\u0000prominence, numerous studies have attempted to combine VLMs for downstream\u0000tasks. Among these, prompt learning has been validated as an effective method\u0000for adapting to new tasks, which only requiring a small number of parameters.\u0000However, current prompt learning methods face two challenges: first, a single\u0000soft prompt struggles to capture the diverse styles and patterns within a\u0000dataset; second, fine-tuning soft prompts is prone to overfitting. To address\u0000these challenges, we propose a mixture of soft prompt learning method\u0000incorporating a routing module. This module is able to capture a dataset's\u0000varied styles and dynamically selects the most suitable prompts for each\u0000instance. Additionally, we introduce a novel gating mechanism to ensure the\u0000router selects prompts based on their similarity to hard prompt templates,\u0000which both retaining knowledge from hard prompts and improving selection\u0000accuracy. We also implement semantically grouped text-level supervision,\u0000initializing each soft prompt with the token embeddings of manually designed\u0000templates from its group and applied a contrastive loss between the resulted\u0000text feature and hard prompt encoded text feature. This supervision ensures\u0000that the text features derived from soft prompts remain close to those from\u0000their corresponding hard prompts, preserving initial knowledge and mitigating\u0000overfitting. Our method has been validated on 11 datasets, demonstrating\u0000evident improvements in few-shot learning, domain generalization, and\u0000base-to-new generalization scenarios compared to existing baselines. The code\u0000will be available at url{https://anonymous.4open.science/r/mocoop-6387}","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0