Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu
A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language models to extract schemas, then hierarchically grounds concrete to abstract components of the schema onto images with vision-language models. The grounded schema is used to augment visual abstraction understanding. We systematically evaluate DSG and different methods in reasoning on our new Visual Abstractions Dataset, which consists of diverse, real-world images of abstract concepts and corresponding question-answer pairs labeled by humans. We show that DSG significantly improves the abstract visual reasoning performance of vision-language models, and is a step toward human-aligned understanding of visual abstractions.
{"title":"What Makes a Maze Look Like a Maze?","authors":"Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu","doi":"arxiv-2409.08202","DOIUrl":"https://doi.org/arxiv-2409.08202","url":null,"abstract":"A unique aspect of human visual understanding is the ability to flexibly\u0000interpret abstract concepts: acquiring lifted rules explaining what they\u0000symbolize, grounding them across familiar and unfamiliar contexts, and making\u0000predictions or reasoning about them. While off-the-shelf vision-language models\u0000excel at making literal interpretations of images (e.g., recognizing object\u0000categories such as tree branches), they still struggle to make sense of such\u0000visual abstractions (e.g., how an arrangement of tree branches may form the\u0000walls of a maze). To address this challenge, we introduce Deep Schema Grounding\u0000(DSG), a framework that leverages explicit structured representations of visual\u0000abstractions for grounding and reasoning. At the core of DSG are\u0000schemas--dependency graph descriptions of abstract concepts that decompose them\u0000into more primitive-level symbols. DSG uses large language models to extract\u0000schemas, then hierarchically grounds concrete to abstract components of the\u0000schema onto images with vision-language models. The grounded schema is used to\u0000augment visual abstraction understanding. We systematically evaluate DSG and\u0000different methods in reasoning on our new Visual Abstractions Dataset, which\u0000consists of diverse, real-world images of abstract concepts and corresponding\u0000question-answer pairs labeled by humans. We show that DSG significantly\u0000improves the abstract visual reasoning performance of vision-language models,\u0000and is a step toward human-aligned understanding of visual abstractions.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu
Panoptic narrative grounding (PNG), whose core target is fine-grained image-text alignment, requires a panoptic segmentation of referred objects given a narrative caption. Previous discriminative methods achieve only weak or coarse-grained alignment by panoptic segmentation pretraining or CLIP model adaptation. Given the recent progress of text-to-image Diffusion models, several works have shown their capability to achieve fine-grained image-text alignment through cross-attention maps and improved general segmentation performance. However, the direct use of phrase features as static prompts to apply frozen Diffusion models to the PNG task still suffers from a large task gap and insufficient vision-language interaction, yielding inferior performance. Therefore, we propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features and inject the multimodal cues back, which leverages the fine-grained image-text alignment capability of Diffusion models more sufficiently. In addition, we also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement. Extensive experiments on the PNG benchmark show that our method achieves new state-of-the-art performance.
{"title":"Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding","authors":"Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu","doi":"arxiv-2409.08251","DOIUrl":"https://doi.org/arxiv-2409.08251","url":null,"abstract":"Panoptic narrative grounding (PNG), whose core target is fine-grained\u0000image-text alignment, requires a panoptic segmentation of referred objects\u0000given a narrative caption. Previous discriminative methods achieve only weak or\u0000coarse-grained alignment by panoptic segmentation pretraining or CLIP model\u0000adaptation. Given the recent progress of text-to-image Diffusion models,\u0000several works have shown their capability to achieve fine-grained image-text\u0000alignment through cross-attention maps and improved general segmentation\u0000performance. However, the direct use of phrase features as static prompts to\u0000apply frozen Diffusion models to the PNG task still suffers from a large task\u0000gap and insufficient vision-language interaction, yielding inferior\u0000performance. Therefore, we propose an Extractive-Injective Phrase Adapter\u0000(EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts\u0000with image features and inject the multimodal cues back, which leverages the\u0000fine-grained image-text alignment capability of Diffusion models more\u0000sufficiently. In addition, we also design a Multi-Level Mutual Aggregation\u0000(MLMA) module to reciprocally fuse multi-level image and phrase features for\u0000segmentation refinement. Extensive experiments on the PNG benchmark show that\u0000our method achieves new state-of-the-art performance.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma
Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.
{"title":"SPARK: Self-supervised Personalized Real-time Monocular Face Capture","authors":"Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma","doi":"arxiv-2409.07984","DOIUrl":"https://doi.org/arxiv-2409.07984","url":null,"abstract":"Feedforward monocular face capture methods seek to reconstruct posed faces\u0000from a single image of a person. Current state of the art approaches have the\u0000ability to regress parametric 3D face models in real-time across a wide range\u0000of identities, lighting conditions and poses by leveraging large image datasets\u0000of human faces. These methods however suffer from clear limitations in that the\u0000underlying parametric face model only provides a coarse estimation of the face\u0000shape, thereby limiting their practical applicability in tasks that require\u0000precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this\u0000paper, we propose a method for high-precision 3D face capture taking advantage\u0000of a collection of unconstrained videos of a subject as prior information. Our\u0000proposal builds on a two stage approach. We start with the reconstruction of a\u0000detailed 3D face avatar of the person, capturing both precise geometry and\u0000appearance from a collection of videos. We then use the encoder from a\u0000pre-trained monocular face reconstruction method, substituting its decoder with\u0000our personalized model, and proceed with transfer learning on the video\u0000collection. Using our pre-estimated image formation model, we obtain a more\u0000precise self-supervision objective, enabling improved expression and pose\u0000alignment. This results in a trained encoder capable of efficiently regressing\u0000pose and expression parameters in real-time from previously unseen images,\u0000which combined with our personalized geometry model yields more accurate and\u0000high fidelity mesh inference. Through extensive qualitative and quantitative\u0000evaluation, we showcase the superiority of our final model as compared to\u0000state-of-the-art baselines, and demonstrate its generalization ability to\u0000unseen pose, expression and lighting.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple object tracking (MOT) involves identifying multiple targets and assigning them corresponding IDs within a video sequence, where occlusions are often encountered. Recent methods address occlusions using appearance cues through online learning techniques to improve adaptivity or offline learning techniques to utilize temporal information from videos. However, most existing online learning-based MOT methods are unable to learn from all past tracking information to improve adaptivity on long-term occlusions while maintaining real-time tracking speed. On the other hand, temporal information-based offline learning methods maintain a long-term memory to store past tracking information, but this approach restricts them to use only local past information during tracking. To address these challenges, we propose a new MOT framework called the Feature Adaptive Continual-learning Tracker (FACT), which enables real-time tracking and feature learning for targets by utilizing all past tracking information. We demonstrate that the framework can be integrated with various state-of-the-art feature-based trackers, thereby improving their tracking ability. Specifically, we develop the feature adaptive continual-learning (FAC) module, a neural network that can be trained online to learn features adaptively using all past tracking information during tracking. Moreover, we also introduce a two-stage association module specifically designed for the proposed continual learning-based tracking. Extensive experiment results demonstrate that the proposed method achieves state-of-the-art online tracking performance on MOT17 and MOT20 benchmarks. The code will be released upon acceptance.
多目标跟踪(MOT)涉及在视频序列中识别多个目标并为其分配相应的 ID,而在视频序列中经常会遇到遮挡物。最近的方法通过在线学习技术来提高适应性,或通过离线学习技术来利用视频中的时间信息,从而利用外观线索来解决遮挡问题。然而,大多数现有的基于在线学习的 MOT 方法都无法从所有过去的跟踪信息中学习,从而在保持实时跟踪速度的同时提高对长期遮挡的适应性。另一方面,基于时间信息的离线学习方法会保留一个长期存储器来存储过去的跟踪信息,但这种方法限制了它们在跟踪过程中只能使用局部的过去信息。为了应对这些挑战,我们提出了一种名为 "特征自适应持续学习跟踪器"(FACT)的新型 MOT 框架,通过利用所有过去的跟踪信息,实现对目标的实时跟踪和特征学习。我们证明,该框架可以与各种最先进的基于特征的跟踪器集成,从而提高它们的跟踪能力。具体来说,我们开发了特征自适应持续学习(FAC)模块,这是一个可在线训练的神经网络,可在跟踪过程中利用所有过去的跟踪信息自适应地学习特征。广泛的实验结果表明,所提出的方法在 MOT17 和 MOT20 基准上实现了最先进的在线跟踪性能。代码将在验收通过后发布。
{"title":"FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking","authors":"Rongzihan Song, Zhenyu Weng, Huiping Zhuang, Jinchang Ren, Yongming Chen, Zhiping Lin","doi":"arxiv-2409.07904","DOIUrl":"https://doi.org/arxiv-2409.07904","url":null,"abstract":"Multiple object tracking (MOT) involves identifying multiple targets and\u0000assigning them corresponding IDs within a video sequence, where occlusions are\u0000often encountered. Recent methods address occlusions using appearance cues\u0000through online learning techniques to improve adaptivity or offline learning\u0000techniques to utilize temporal information from videos. However, most existing\u0000online learning-based MOT methods are unable to learn from all past tracking\u0000information to improve adaptivity on long-term occlusions while maintaining\u0000real-time tracking speed. On the other hand, temporal information-based offline\u0000learning methods maintain a long-term memory to store past tracking\u0000information, but this approach restricts them to use only local past\u0000information during tracking. To address these challenges, we propose a new MOT\u0000framework called the Feature Adaptive Continual-learning Tracker (FACT), which\u0000enables real-time tracking and feature learning for targets by utilizing all\u0000past tracking information. We demonstrate that the framework can be integrated\u0000with various state-of-the-art feature-based trackers, thereby improving their\u0000tracking ability. Specifically, we develop the feature adaptive\u0000continual-learning (FAC) module, a neural network that can be trained online to\u0000learn features adaptively using all past tracking information during tracking.\u0000Moreover, we also introduce a two-stage association module specifically\u0000designed for the proposed continual learning-based tracking. Extensive\u0000experiment results demonstrate that the proposed method achieves\u0000state-of-the-art online tracking performance on MOT17 and MOT20 benchmarks. The\u0000code will be released upon acceptance.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural Radiance Fields have achieved success in creating powerful 3D media representations with their exceptional reconstruction capabilities. However, the computational demands of volume rendering pose significant challenges during model training. Existing acceleration techniques often involve redesigning the model architecture, leading to limitations in compatibility across different frameworks. Furthermore, these methods tend to overlook the substantial memory costs incurred. In response to these challenges, we introduce an expansive supervision mechanism that efficiently balances computational load, rendering quality and flexibility for neural radiance field training. This mechanism operates by selectively rendering a small but crucial subset of pixels and expanding their values to estimate the error across the entire area for each iteration. Compare to conventional supervision, our method effectively bypasses redundant rendering processes, resulting in notable reductions in both time and memory consumption. Experimental results demonstrate that integrating expansive supervision within existing state-of-the-art acceleration frameworks can achieve 69% memory savings and 42% time savings, with negligible compromise in visual quality.
{"title":"Expansive Supervision for Neural Radiance Field","authors":"Weixiang Zhang, Shuzhao Xie, Shijia Ge, Wei Yao, Chen Tang, Zhi Wang","doi":"arxiv-2409.08056","DOIUrl":"https://doi.org/arxiv-2409.08056","url":null,"abstract":"Neural Radiance Fields have achieved success in creating powerful 3D media\u0000representations with their exceptional reconstruction capabilities. However,\u0000the computational demands of volume rendering pose significant challenges\u0000during model training. Existing acceleration techniques often involve\u0000redesigning the model architecture, leading to limitations in compatibility\u0000across different frameworks. Furthermore, these methods tend to overlook the\u0000substantial memory costs incurred. In response to these challenges, we\u0000introduce an expansive supervision mechanism that efficiently balances\u0000computational load, rendering quality and flexibility for neural radiance field\u0000training. This mechanism operates by selectively rendering a small but crucial\u0000subset of pixels and expanding their values to estimate the error across the\u0000entire area for each iteration. Compare to conventional supervision, our method\u0000effectively bypasses redundant rendering processes, resulting in notable\u0000reductions in both time and memory consumption. Experimental results\u0000demonstrate that integrating expansive supervision within existing\u0000state-of-the-art acceleration frameworks can achieve 69% memory savings and 42%\u0000time savings, with negligible compromise in visual quality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Audio-driven 3D facial animation synthesis has been an active field of research with attention from both academia and industry. While there are promising results in this area, recent approaches largely focus on lip-sync and identity control, neglecting the role of emotions and emotion control in the generative process. That is mainly due to the lack of emotionally rich facial animation data and algorithms that can synthesize speech animations with emotional expressions at the same time. In addition, majority of the models are deterministic, meaning given the same audio input, they produce the same output motion. We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations. In this paper, we propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and an emotionally rich facial animation dataset 3DMEAD. We provide an extensive comparative analysis of our model against the recent 3D facial animation synthesis approaches, by evaluating the results objectively, qualitatively, and with a perceptual user study. We highlight several objective metrics that are more suitable for evaluating stochastic outputs and use both in-the-wild and ground truth data for subjective evaluation. To our knowledge, that is the first non-deterministic 3D facial animation synthesis method incorporating a rich emotion dataset and emotion control with emotion labels and intensity levels. Our evaluation demonstrates that the proposed model achieves superior performance compared to state-of-the-art emotion-controlled, deterministic and non-deterministic models. We recommend watching the supplementary video for quality judgement. The entire codebase is publicly available (https://github.com/uuembodiedsocialai/ProbTalk3D/).
{"title":"ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE","authors":"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak","doi":"arxiv-2409.07966","DOIUrl":"https://doi.org/arxiv-2409.07966","url":null,"abstract":"Audio-driven 3D facial animation synthesis has been an active field of\u0000research with attention from both academia and industry. While there are\u0000promising results in this area, recent approaches largely focus on lip-sync and\u0000identity control, neglecting the role of emotions and emotion control in the\u0000generative process. That is mainly due to the lack of emotionally rich facial\u0000animation data and algorithms that can synthesize speech animations with\u0000emotional expressions at the same time. In addition, majority of the models are\u0000deterministic, meaning given the same audio input, they produce the same output\u0000motion. We argue that emotions and non-determinism are crucial to generate\u0000diverse and emotionally-rich facial animations. In this paper, we propose\u0000ProbTalk3D a non-deterministic neural network approach for emotion controllable\u0000speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\u0000an emotionally rich facial animation dataset 3DMEAD. We provide an extensive\u0000comparative analysis of our model against the recent 3D facial animation\u0000synthesis approaches, by evaluating the results objectively, qualitatively, and\u0000with a perceptual user study. We highlight several objective metrics that are\u0000more suitable for evaluating stochastic outputs and use both in-the-wild and\u0000ground truth data for subjective evaluation. To our knowledge, that is the\u0000first non-deterministic 3D facial animation synthesis method incorporating a\u0000rich emotion dataset and emotion control with emotion labels and intensity\u0000levels. Our evaluation demonstrates that the proposed model achieves superior\u0000performance compared to state-of-the-art emotion-controlled, deterministic and\u0000non-deterministic models. We recommend watching the supplementary video for\u0000quality judgement. The entire codebase is publicly available\u0000(https://github.com/uuembodiedsocialai/ProbTalk3D/).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set. This area thus studies the effective use of unlabeled data to reduce the performance gap that arises due to the lack of annotations. In this work, inspired by Bayesian deep learning, we first propose a Bayesian self-training framework for semi-supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo-labels and then filter these based on estimated point-wise uncertainty. By constructing a heuristic $n$-partite matching algorithm, we extend the method to semi-supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. We demonstrate state-of-the-art results for our semi-supervised method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial improvements in dense 3D visual grounding over supervised-only baselines on ScanRefer. Our project page is available at ouenal.github.io/bst/.
{"title":"Bayesian Self-Training for Semi-Supervised 3D Segmentation","authors":"Ozan Unal, Christos Sakaridis, Luc Van Gool","doi":"arxiv-2409.08102","DOIUrl":"https://doi.org/arxiv-2409.08102","url":null,"abstract":"3D segmentation is a core problem in computer vision and, similarly to many\u0000other dense prediction tasks, it requires large amounts of annotated data for\u0000adequate training. However, densely labeling 3D point clouds to employ\u0000fully-supervised training remains too labor intensive and expensive.\u0000Semi-supervised training provides a more practical alternative, where only a\u0000small set of labeled data is given, accompanied by a larger unlabeled set. This\u0000area thus studies the effective use of unlabeled data to reduce the performance\u0000gap that arises due to the lack of annotations. In this work, inspired by\u0000Bayesian deep learning, we first propose a Bayesian self-training framework for\u0000semi-supervised 3D semantic segmentation. Employing stochastic inference, we\u0000generate an initial set of pseudo-labels and then filter these based on\u0000estimated point-wise uncertainty. By constructing a heuristic $n$-partite\u0000matching algorithm, we extend the method to semi-supervised 3D instance\u0000segmentation, and finally, with the same building blocks, to dense 3D visual\u0000grounding. We demonstrate state-of-the-art results for our semi-supervised\u0000method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on\u0000ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial\u0000improvements in dense 3D visual grounding over supervised-only baselines on\u0000ScanRefer. Our project page is available at ouenal.github.io/bst/.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia
High frame rate and accurate depth estimation plays an important role in several tasks crucial to robotics and automotive perception. To date, this can be achieved through ToF and LiDAR devices for indoor and outdoor applications, respectively. However, their applicability is limited by low frame rate, energy consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate temporal and spatial depth densification achieved by exploiting a high frame rate RGB sensor coupled with a potentially lower frame rate and sparse active depth sensor. Our proposal jointly enables lower energy consumption and denser shape reconstruction, by significantly reducing the streaming requirements on the depth sensor thanks to its three core stages: i) multi-modal encoding, ii) iterative multi-modal integration, and iii) depth decoding. We present extended evidence assessing the effectiveness of DoD on indoor and outdoor video datasets, covering both environment scanning and automotive perception use cases.
高帧率和精确的深度估计在机器人和汽车感知的多项关键任务中发挥着重要作用。迄今为止,可通过分别用于室内和室外应用的 ToF 和激光雷达设备实现这一目标。然而,它们的适用性受到低帧频、能耗和空间稀疏性的限制。按需深度(Depth on Demand,DoD)通过利用高帧率 RGB 传感器和潜在的低帧率稀疏深度传感器,实现了精确的时空深度密集化。我们的方案通过三个核心阶段:i)多模态编码;ii)迭代多模态整合;iii)深度解码,显著降低了对深度传感器的流媒体要求,从而实现了更低的能耗和更密集的形状重建。我们介绍了评估 DoD 在室内和室外视频数据集上有效性的扩展证据,涵盖了环境扫描和汽车感知用例。
{"title":"Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor","authors":"Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia","doi":"arxiv-2409.08277","DOIUrl":"https://doi.org/arxiv-2409.08277","url":null,"abstract":"High frame rate and accurate depth estimation plays an important role in\u0000several tasks crucial to robotics and automotive perception. To date, this can\u0000be achieved through ToF and LiDAR devices for indoor and outdoor applications,\u0000respectively. However, their applicability is limited by low frame rate, energy\u0000consumption, and spatial sparsity. Depth on Demand (DoD) allows for accurate\u0000temporal and spatial depth densification achieved by exploiting a high frame\u0000rate RGB sensor coupled with a potentially lower frame rate and sparse active\u0000depth sensor. Our proposal jointly enables lower energy consumption and denser\u0000shape reconstruction, by significantly reducing the streaming requirements on\u0000the depth sensor thanks to its three core stages: i) multi-modal encoding, ii)\u0000iterative multi-modal integration, and iii) depth decoding. We present extended\u0000evidence assessing the effectiveness of DoD on indoor and outdoor video\u0000datasets, covering both environment scanning and automotive perception use\u0000cases.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"174 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a simple but effective training-free approach tailored to diffusion-based image-to-image translation. Our approach revises the original noise prediction network of a pretrained diffusion model by introducing a noise correction term. We formulate the noise correction term as the difference between two noise predictions; one is computed from the denoising network with a progressive interpolation of the source and target prompt embeddings, while the other is the noise prediction with the source prompt embedding. The final noise prediction network is given by a linear combination of the standard denoising term and the noise correction term, where the former is designed to reconstruct must-be-preserved regions while the latter aims to effectively edit regions of interest relevant to the target prompt. Our approach can be easily incorporated into existing image-to-image translation methods based on diffusion models. Extensive experiments verify that the proposed technique achieves outstanding performance with low latency and consistently improves existing frameworks when combined with them.
{"title":"Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation","authors":"Junsung Lee, Minsoo Kang, Bohyung Han","doi":"arxiv-2409.08077","DOIUrl":"https://doi.org/arxiv-2409.08077","url":null,"abstract":"We propose a simple but effective training-free approach tailored to\u0000diffusion-based image-to-image translation. Our approach revises the original\u0000noise prediction network of a pretrained diffusion model by introducing a noise\u0000correction term. We formulate the noise correction term as the difference\u0000between two noise predictions; one is computed from the denoising network with\u0000a progressive interpolation of the source and target prompt embeddings, while\u0000the other is the noise prediction with the source prompt embedding. The final\u0000noise prediction network is given by a linear combination of the standard\u0000denoising term and the noise correction term, where the former is designed to\u0000reconstruct must-be-preserved regions while the latter aims to effectively edit\u0000regions of interest relevant to the target prompt. Our approach can be easily\u0000incorporated into existing image-to-image translation methods based on\u0000diffusion models. Extensive experiments verify that the proposed technique\u0000achieves outstanding performance with low latency and consistently improves\u0000existing frameworks when combined with them.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang
The task of vision-based 3D occupancy prediction aims to reconstruct 3D geometry and estimate its semantic classes from 2D color images, where the 2D-to-3D view transformation is an indispensable step. Most previous methods conduct forward projection, such as BEVPooling and VoxelPooling, both of which map the 2D image features into 3D grids. However, the current grid representing features within a certain height range usually introduces many confusing features that belong to other height ranges. To address this challenge, we present Deep Height Decoupling (DHD), a novel framework that incorporates explicit height prior to filter out the confusing features. Specifically, DHD first predicts height maps via explicit supervision. Based on the height distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to adaptively decoupled the height map into multiple binary masks. MGHS projects the 2D image features into multiple subspaces, where each grid contains features within reasonable height ranges. Finally, a Synergistic Feature Aggregation (SFA) module is deployed to enhance the feature representation through channel and spatial affinities, enabling further occupancy refinement. On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art performance even with minimal input frames. Code is available at https://github.com/yanzq95/DHD.
{"title":"Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction","authors":"Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang","doi":"arxiv-2409.07972","DOIUrl":"https://doi.org/arxiv-2409.07972","url":null,"abstract":"The task of vision-based 3D occupancy prediction aims to reconstruct 3D\u0000geometry and estimate its semantic classes from 2D color images, where the\u00002D-to-3D view transformation is an indispensable step. Most previous methods\u0000conduct forward projection, such as BEVPooling and VoxelPooling, both of which\u0000map the 2D image features into 3D grids. However, the current grid representing\u0000features within a certain height range usually introduces many confusing\u0000features that belong to other height ranges. To address this challenge, we\u0000present Deep Height Decoupling (DHD), a novel framework that incorporates\u0000explicit height prior to filter out the confusing features. Specifically, DHD\u0000first predicts height maps via explicit supervision. Based on the height\u0000distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to\u0000adaptively decoupled the height map into multiple binary masks. MGHS projects\u0000the 2D image features into multiple subspaces, where each grid contains\u0000features within reasonable height ranges. Finally, a Synergistic Feature\u0000Aggregation (SFA) module is deployed to enhance the feature representation\u0000through channel and spatial affinities, enabling further occupancy refinement.\u0000On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art\u0000performance even with minimal input frames. Code is available at\u0000https://github.com/yanzq95/DHD.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}