arXiv - CS - Computer Vision and Pattern Recognition最新文献_第9页

What Makes a Maze Look Like a Maze? 是什么让迷宫看起来像迷宫？

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08202

Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu

A unique aspect of human visual understanding is the ability to flexiblyinterpret abstract concepts: acquiring lifted rules explaining what theysymbolize, grounding them across familiar and unfamiliar contexts, and makingpredictions or reasoning about them. While off-the-shelf vision-language modelsexcel at making literal interpretations of images (e.g., recognizing objectcategories such as tree branches), they still struggle to make sense of suchvisual abstractions (e.g., how an arrangement of tree branches may form thewalls of a maze). To address this challenge, we introduce Deep Schema Grounding(DSG), a framework that leverages explicit structured representations of visualabstractions for grounding and reasoning. At the core of DSG areschemas--dependency graph descriptions of abstract concepts that decompose theminto more primitive-level symbols. DSG uses large language models to extractschemas, then hierarchically grounds concrete to abstract components of theschema onto images with vision-language models. The grounded schema is used toaugment visual abstraction understanding. We systematically evaluate DSG anddifferent methods in reasoning on our new Visual Abstractions Dataset, whichconsists of diverse, real-world images of abstract concepts and correspondingquestion-answer pairs labeled by humans. We show that DSG significantlyimproves the abstract visual reasoning performance of vision-language models,and is a step toward human-aligned understanding of visual abstractions.

人类视觉理解能力的一个独特方面是灵活解释抽象概念的能力：获得解释抽象概念所代表含义的提升规则，在熟悉和不熟悉的语境中为它们提供基础，并对它们进行预测或推理。虽然现成的视觉语言模型在对图像进行字面解释（如识别树枝等物体类别）方面表现出色，但在理解这类视觉抽象概念（如树枝的排列如何构成迷宫的墙壁）方面仍有困难。为了应对这一挑战，我们引入了深度模式基础（DSG），这是一种利用视觉抽象的显式结构化表示来进行基础和推理的框架。DSG 的核心是模式--抽象概念的依赖图描述，它将抽象概念分解为更原始的符号。DSG 使用大型语言模型来提取模式，然后通过视觉语言模型将模式中从具体到抽象的组成部分分层地建立在图像上。基础模式用于增强视觉抽象理解。我们在新的视觉抽象数据集（Visual Abstractions Dataset）上系统地评估了 DSG 和不同方法的推理效果，该数据集包含各种真实世界的抽象概念图像和由人类标注的相应问答对。我们的研究表明，DSG 显著提高了视觉语言模型的抽象视觉推理性能，并向人类对视觉抽象概念的理解迈出了一步。

{"title":"What Makes a Maze Look Like a Maze?","authors":"Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu","doi":"arxiv-2409.08202","DOIUrl":"https://doi.org/arxiv-2409.08202","url":null,"abstract":"A unique aspect of human visual understanding is the ability to flexibly\u0000interpret abstract concepts: acquiring lifted rules explaining what they\u0000symbolize, grounding them across familiar and unfamiliar contexts, and making\u0000predictions or reasoning about them. While off-the-shelf vision-language models\u0000excel at making literal interpretations of images (e.g., recognizing object\u0000categories such as tree branches), they still struggle to make sense of such\u0000visual abstractions (e.g., how an arrangement of tree branches may form the\u0000walls of a maze). To address this challenge, we introduce Deep Schema Grounding\u0000(DSG), a framework that leverages explicit structured representations of visual\u0000abstractions for grounding and reasoning. At the core of DSG are\u0000schemas--dependency graph descriptions of abstract concepts that decompose them\u0000into more primitive-level symbols. DSG uses large language models to extract\u0000schemas, then hierarchically grounds concrete to abstract components of the\u0000schema onto images with vision-language models. The grounded schema is used to\u0000augment visual abstraction understanding. We systematically evaluate DSG and\u0000different methods in reasoning on our new Visual Abstractions Dataset, which\u0000consists of diverse, real-world images of abstract concepts and corresponding\u0000question-answer pairs labeled by humans. We show that DSG significantly\u0000improves the abstract visual reasoning performance of vision-language models,\u0000and is a step toward human-aligned understanding of visual abstractions.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding 为全景叙事接地而动态提示冻结文本到图像的扩散模型

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08251

Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu

Panoptic narrative grounding (PNG), whose core target is fine-grainedimage-text alignment, requires a panoptic segmentation of referred objectsgiven a narrative caption. Previous discriminative methods achieve only weak orcoarse-grained alignment by panoptic segmentation pretraining or CLIP modeladaptation. Given the recent progress of text-to-image Diffusion models,several works have shown their capability to achieve fine-grained image-textalignment through cross-attention maps and improved general segmentationperformance. However, the direct use of phrase features as static prompts toapply frozen Diffusion models to the PNG task still suffers from a large taskgap and insufficient vision-language interaction, yielding inferiorperformance. Therefore, we propose an Extractive-Injective Phrase Adapter(EIPA) bypass within the Diffusion UNet to dynamically update phrase promptswith image features and inject the multimodal cues back, which leverages thefine-grained image-text alignment capability of Diffusion models moresufficiently. In addition, we also design a Multi-Level Mutual Aggregation(MLMA) module to reciprocally fuse multi-level image and phrase features forsegmentation refinement. Extensive experiments on the PNG benchmark show thatour method achieves new state-of-the-art performance.

全景叙事接地（PNG）的核心目标是细粒度图像-文本配准，它需要在叙事标题下对所指对象进行全景分割。以前的判别方法只能通过全景分割预训练或 CLIP 模型适应来实现微弱或粗粒度的配准。鉴于文本到图像扩散模型最近取得的进展，有几项研究表明它们有能力通过交叉注意图实现精细的图像-文本配准，并提高一般分割性能。然而，直接使用短语特征作为静态提示，将冻结的 Diffusion 模型应用到 PNG 任务中，仍然存在较大的任务差距和视觉语言交互不足的问题，导致性能较差。因此，我们在 Diffusion UNet 中提出了提取-注入短语适配器（EIPA）旁路，利用图像特征动态更新短语提示，并将多模态线索注入回来，从而更有效地利用了 Diffusion 模型的精细图像-文本配准能力。此外，我们还设计了一个多级相互聚合（MLMA）模块，用于相互融合多级图像和短语特征以细化分割。在 PNG 基准上进行的大量实验表明，我们的方法达到了最先进的新性能。

{"title":"Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding","authors":"Hongyu Li, Tianrui Hui, Zihan Ding, Jing Zhang, Bin Ma, Xiaoming Wei, Jizhong Han, Si Liu","doi":"arxiv-2409.08251","DOIUrl":"https://doi.org/arxiv-2409.08251","url":null,"abstract":"Panoptic narrative grounding (PNG), whose core target is fine-grained\u0000image-text alignment, requires a panoptic segmentation of referred objects\u0000given a narrative caption. Previous discriminative methods achieve only weak or\u0000coarse-grained alignment by panoptic segmentation pretraining or CLIP model\u0000adaptation. Given the recent progress of text-to-image Diffusion models,\u0000several works have shown their capability to achieve fine-grained image-text\u0000alignment through cross-attention maps and improved general segmentation\u0000performance. However, the direct use of phrase features as static prompts to\u0000apply frozen Diffusion models to the PNG task still suffers from a large task\u0000gap and insufficient vision-language interaction, yielding inferior\u0000performance. Therefore, we propose an Extractive-Injective Phrase Adapter\u0000(EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts\u0000with image features and inject the multimodal cues back, which leverages the\u0000fine-grained image-text alignment capability of Diffusion models more\u0000sufficiently. In addition, we also design a Multi-Level Mutual Aggregation\u0000(MLMA) module to reciprocally fuse multi-level image and phrase features for\u0000segmentation refinement. Extensive experiments on the PNG benchmark show that\u0000our method achieves new state-of-the-art performance.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SPARK: Self-supervised Personalized Real-time Monocular Face Capture SPARK：自我监督的个性化实时单目人脸捕捉

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07984

Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma

Feedforward monocular face capture methods seek to reconstruct posed facesfrom a single image of a person. Current state of the art approaches have theability to regress parametric 3D face models in real-time across a wide rangeof identities, lighting conditions and poses by leveraging large image datasetsof human faces. These methods however suffer from clear limitations in that theunderlying parametric face model only provides a coarse estimation of the faceshape, thereby limiting their practical applicability in tasks that requireprecise 3D reconstruction (aging, face swapping, digital make-up, ...). In thispaper, we propose a method for high-precision 3D face capture taking advantageof a collection of unconstrained videos of a subject as prior information. Ourproposal builds on a two stage approach. We start with the reconstruction of adetailed 3D face avatar of the person, capturing both precise geometry andappearance from a collection of videos. We then use the encoder from apre-trained monocular face reconstruction method, substituting its decoder withour personalized model, and proceed with transfer learning on the videocollection. Using our pre-estimated image formation model, we obtain a moreprecise self-supervision objective, enabling improved expression and posealignment. This results in a trained encoder capable of efficiently regressingpose and expression parameters in real-time from previously unseen images,which combined with our personalized geometry model yields more accurate andhigh fidelity mesh inference. Through extensive qualitative and quantitativeevaluation, we showcase the superiority of our final model as compared tostate-of-the-art baselines, and demonstrate its generalization ability tounseen pose, expression and lighting.

前馈单目人脸捕捉方法旨在从单张人脸图像中重建摆好姿势的人脸。目前最先进的方法能够利用大型人脸图像数据集，在各种身份、光照条件和姿势下实时回归参数化三维人脸模型。然而，这些方法存在明显的局限性，即所依据的参数化人脸模型只能提供对脸型的粗略估计，从而限制了它们在需要精确三维重建的任务（老化、换脸、数字化妆......）中的实际应用。在本文中，我们提出了一种高精度三维人脸捕捉方法，该方法利用了主体的无约束视频集合作为先验信息。我们的建议基于两个阶段的方法。首先，我们从视频集合中捕捉人物的精确几何形状和外貌，重建详细的三维人脸头像。然后，我们使用预先训练好的单眼人脸重建方法中的编码器，用我们的个性化模型代替其解码器，并在视频集合上进行迁移学习。利用我们预先估计的图像形成模型，我们获得了更精确的自我监督目标，从而改进了表情和姿势对齐。这样，训练有素的编码器就能从以前未见过的图像中实时有效地回归姿势和表情参数，再结合我们的个性化几何模型，就能获得更准确、保真度更高的网格推理。通过广泛的定性和定量评估，我们展示了我们的最终模型与最先进的基线模型相比的优越性，并证明了它对可见姿势、表情和光照的泛化能力。

{"title":"SPARK: Self-supervised Personalized Real-time Monocular Face Capture","authors":"Kelian Baert, Shrisha Bharadwaj, Fabien Castan, Benoit Maujean, Marc Christie, Victoria Abrevaya, Adnane Boukhayma","doi":"arxiv-2409.07984","DOIUrl":"https://doi.org/arxiv-2409.07984","url":null,"abstract":"Feedforward monocular face capture methods seek to reconstruct posed faces\u0000from a single image of a person. Current state of the art approaches have the\u0000ability to regress parametric 3D face models in real-time across a wide range\u0000of identities, lighting conditions and poses by leveraging large image datasets\u0000of human faces. These methods however suffer from clear limitations in that the\u0000underlying parametric face model only provides a coarse estimation of the face\u0000shape, thereby limiting their practical applicability in tasks that require\u0000precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this\u0000paper, we propose a method for high-precision 3D face capture taking advantage\u0000of a collection of unconstrained videos of a subject as prior information. Our\u0000proposal builds on a two stage approach. We start with the reconstruction of a\u0000detailed 3D face avatar of the person, capturing both precise geometry and\u0000appearance from a collection of videos. We then use the encoder from a\u0000pre-trained monocular face reconstruction method, substituting its decoder with\u0000our personalized model, and proceed with transfer learning on the video\u0000collection. Using our pre-estimated image formation model, we obtain a more\u0000precise self-supervision objective, enabling improved expression and pose\u0000alignment. This results in a trained encoder capable of efficiently regressing\u0000pose and expression parameters in real-time from previously unseen images,\u0000which combined with our personalized geometry model yields more accurate and\u0000high fidelity mesh inference. Through extensive qualitative and quantitative\u0000evaluation, we showcase the superiority of our final model as compared to\u0000state-of-the-art baselines, and demonstrate its generalization ability to\u0000unseen pose, expression and lighting.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking FACT：用于多目标跟踪的特征自适应持续学习跟踪器

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07904

Rongzihan Song, Zhenyu Weng, Huiping Zhuang, Jinchang Ren, Yongming Chen, Zhiping Lin

Multiple object tracking (MOT) involves identifying multiple targets andassigning them corresponding IDs within a video sequence, where occlusions areoften encountered. Recent methods address occlusions using appearance cuesthrough online learning techniques to improve adaptivity or offline learningtechniques to utilize temporal information from videos. However, most existingonline learning-based MOT methods are unable to learn from all past trackinginformation to improve adaptivity on long-term occlusions while maintainingreal-time tracking speed. On the other hand, temporal information-based offlinelearning methods maintain a long-term memory to store past trackinginformation, but this approach restricts them to use only local pastinformation during tracking. To address these challenges, we propose a new MOTframework called the Feature Adaptive Continual-learning Tracker (FACT), whichenables real-time tracking and feature learning for targets by utilizing allpast tracking information. We demonstrate that the framework can be integratedwith various state-of-the-art feature-based trackers, thereby improving theirtracking ability. Specifically, we develop the feature adaptivecontinual-learning (FAC) module, a neural network that can be trained online tolearn features adaptively using all past tracking information during tracking.Moreover, we also introduce a two-stage association module specificallydesigned for the proposed continual learning-based tracking. Extensiveexperiment results demonstrate that the proposed method achievesstate-of-the-art online tracking performance on MOT17 and MOT20 benchmarks. Thecode will be released upon acceptance.

多目标跟踪（MOT）涉及在视频序列中识别多个目标并为其分配相应的 ID，而在视频序列中经常会遇到遮挡物。最近的方法通过在线学习技术来提高适应性，或通过离线学习技术来利用视频中的时间信息，从而利用外观线索来解决遮挡问题。然而，大多数现有的基于在线学习的 MOT 方法都无法从所有过去的跟踪信息中学习，从而在保持实时跟踪速度的同时提高对长期遮挡的适应性。另一方面，基于时间信息的离线学习方法会保留一个长期存储器来存储过去的跟踪信息，但这种方法限制了它们在跟踪过程中只能使用局部的过去信息。为了应对这些挑战，我们提出了一种名为 "特征自适应持续学习跟踪器"（FACT）的新型 MOT 框架，通过利用所有过去的跟踪信息，实现对目标的实时跟踪和特征学习。我们证明，该框架可以与各种最先进的基于特征的跟踪器集成，从而提高它们的跟踪能力。具体来说，我们开发了特征自适应持续学习（FAC）模块，这是一个可在线训练的神经网络，可在跟踪过程中利用所有过去的跟踪信息自适应地学习特征。广泛的实验结果表明，所提出的方法在 MOT17 和 MOT20 基准上实现了最先进的在线跟踪性能。代码将在验收通过后发布。

{"title":"FACT: Feature Adaptive Continual-learning Tracker for Multiple Object Tracking","authors":"Rongzihan Song, Zhenyu Weng, Huiping Zhuang, Jinchang Ren, Yongming Chen, Zhiping Lin","doi":"arxiv-2409.07904","DOIUrl":"https://doi.org/arxiv-2409.07904","url":null,"abstract":"Multiple object tracking (MOT) involves identifying multiple targets and\u0000assigning them corresponding IDs within a video sequence, where occlusions are\u0000often encountered. Recent methods address occlusions using appearance cues\u0000through online learning techniques to improve adaptivity or offline learning\u0000techniques to utilize temporal information from videos. However, most existing\u0000online learning-based MOT methods are unable to learn from all past tracking\u0000information to improve adaptivity on long-term occlusions while maintaining\u0000real-time tracking speed. On the other hand, temporal information-based offline\u0000learning methods maintain a long-term memory to store past tracking\u0000information, but this approach restricts them to use only local past\u0000information during tracking. To address these challenges, we propose a new MOT\u0000framework called the Feature Adaptive Continual-learning Tracker (FACT), which\u0000enables real-time tracking and feature learning for targets by utilizing all\u0000past tracking information. We demonstrate that the framework can be integrated\u0000with various state-of-the-art feature-based trackers, thereby improving their\u0000tracking ability. Specifically, we develop the feature adaptive\u0000continual-learning (FAC) module, a neural network that can be trained online to\u0000learn features adaptively using all past tracking information during tracking.\u0000Moreover, we also introduce a two-stage association module specifically\u0000designed for the proposed continual learning-based tracking. Extensive\u0000experiment results demonstrate that the proposed method achieves\u0000state-of-the-art online tracking performance on MOT17 and MOT20 benchmarks. The\u0000code will be released upon acceptance.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Expansive Supervision for Neural Radiance Field 神经辐射场的扩展监督

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08056

Weixiang Zhang, Shuzhao Xie, Shijia Ge, Wei Yao, Chen Tang, Zhi Wang

Neural Radiance Fields have achieved success in creating powerful 3D mediarepresentations with their exceptional reconstruction capabilities. However,the computational demands of volume rendering pose significant challengesduring model training. Existing acceleration techniques often involveredesigning the model architecture, leading to limitations in compatibilityacross different frameworks. Furthermore, these methods tend to overlook thesubstantial memory costs incurred. In response to these challenges, weintroduce an expansive supervision mechanism that efficiently balancescomputational load, rendering quality and flexibility for neural radiance fieldtraining. This mechanism operates by selectively rendering a small but crucialsubset of pixels and expanding their values to estimate the error across theentire area for each iteration. Compare to conventional supervision, our methodeffectively bypasses redundant rendering processes, resulting in notablereductions in both time and memory consumption. Experimental resultsdemonstrate that integrating expansive supervision within existingstate-of-the-art acceleration frameworks can achieve 69% memory savings and 42%time savings, with negligible compromise in visual quality.

神经辐射场凭借其卓越的重建能力，在创建强大的三维媒体表现方面取得了成功。然而，体积渲染的计算需求给模型训练带来了巨大挑战。现有的加速技术往往需要设计模型架构，导致不同框架之间的兼容性受到限制。此外，这些方法往往忽略了由此产生的巨大内存成本。为了应对这些挑战，我们引入了一种扩展性监督机制，它能有效地平衡神经辐射场训练的计算负荷、渲染质量和灵活性。该机制通过选择性地渲染一小部分关键像素，并扩展其值来估计每次迭代中整个区域的误差。与传统的监督相比，我们的方法有效地绕过了多余的渲染过程，从而显著减少了时间和内存消耗。实验结果表明，在现有的最先进的加速框架中集成扩展式监督，可以节省 69% 的内存和 42% 的时间，而视觉质量的影响几乎可以忽略不计。

{"title":"Expansive Supervision for Neural Radiance Field","authors":"Weixiang Zhang, Shuzhao Xie, Shijia Ge, Wei Yao, Chen Tang, Zhi Wang","doi":"arxiv-2409.08056","DOIUrl":"https://doi.org/arxiv-2409.08056","url":null,"abstract":"Neural Radiance Fields have achieved success in creating powerful 3D media\u0000representations with their exceptional reconstruction capabilities. However,\u0000the computational demands of volume rendering pose significant challenges\u0000during model training. Existing acceleration techniques often involve\u0000redesigning the model architecture, leading to limitations in compatibility\u0000across different frameworks. Furthermore, these methods tend to overlook the\u0000substantial memory costs incurred. In response to these challenges, we\u0000introduce an expansive supervision mechanism that efficiently balances\u0000computational load, rendering quality and flexibility for neural radiance field\u0000training. This mechanism operates by selectively rendering a small but crucial\u0000subset of pixels and expanding their values to estimate the error across the\u0000entire area for each iteration. Compare to conventional supervision, our method\u0000effectively bypasses redundant rendering processes, resulting in notable\u0000reductions in both time and memory consumption. Experimental results\u0000demonstrate that integrating expansive supervision within existing\u0000state-of-the-art acceleration frameworks can achieve 69% memory savings and 42%\u0000time savings, with negligible compromise in visual quality.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE ProbTalk3D：使用 VQ-VAE 进行非确定性情感可控语音驱动三维面部动画合成

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07966

Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak

Audio-driven 3D facial animation synthesis has been an active field ofresearch with attention from both academia and industry. While there arepromising results in this area, recent approaches largely focus on lip-sync andidentity control, neglecting the role of emotions and emotion control in thegenerative process. That is mainly due to the lack of emotionally rich facialanimation data and algorithms that can synthesize speech animations withemotional expressions at the same time. In addition, majority of the models aredeterministic, meaning given the same audio input, they produce the same outputmotion. We argue that emotions and non-determinism are crucial to generatediverse and emotionally-rich facial animations. In this paper, we proposeProbTalk3D a non-deterministic neural network approach for emotion controllablespeech-driven 3D facial animation synthesis using a two-stage VQ-VAE model andan emotionally rich facial animation dataset 3DMEAD. We provide an extensivecomparative analysis of our model against the recent 3D facial animationsynthesis approaches, by evaluating the results objectively, qualitatively, andwith a perceptual user study. We highlight several objective metrics that aremore suitable for evaluating stochastic outputs and use both in-the-wild andground truth data for subjective evaluation. To our knowledge, that is thefirst non-deterministic 3D facial animation synthesis method incorporating arich emotion dataset and emotion control with emotion labels and intensitylevels. Our evaluation demonstrates that the proposed model achieves superiorperformance compared to state-of-the-art emotion-controlled, deterministic andnon-deterministic models. We recommend watching the supplementary video forquality judgement. The entire codebase is publicly available(https://github.com/uuembodiedsocialai/ProbTalk3D/).

音频驱动的三维面部动画合成一直是一个活跃的研究领域，受到学术界和工业界的关注。虽然在这一领域取得了令人鼓舞的成果，但最近的研究方法主要集中在唇部同步和身份控制上，而忽视了情感和情感控制在合成过程中的作用。这主要是由于缺乏情感丰富的面部动画数据和能同时合成具有情感表达的语音动画的算法。此外，大多数模型都是确定性的，即给定相同的音频输入，它们会产生相同的输出动作。我们认为，情感和非确定性对于生成多样化和情感丰富的面部动画至关重要。在本文中，我们使用两阶段 VQ-VAE 模型和情感丰富的面部动画数据集 3DMEAD，提出了一种用于情感可控语音驱动三维面部动画合成的非确定性神经网络方法 ProbTalk3D。我们通过对结果进行客观、定性和用户感知研究评估，对我们的模型与最近的三维面部动画合成方法进行了广泛的比较分析。我们强调了几个更适合评估随机输出的客观指标，并使用野外和地面真实数据进行主观评估。据我们所知，这是第一种非确定性三维面部动画合成方法，其中包含丰富的情感数据集以及带有情感标签和强度级别的情感控制。我们的评估结果表明，与最先进的情感控制、确定性和非确定性模型相比，所提出的模型具有更出色的性能。我们建议您观看补充视频以进行质量判断。整个代码库均可公开获取（https://github.com/uuembodiedsocialai/ProbTalk3D/）。

{"title":"ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE","authors":"Sichun Wu, Kazi Injamamul Haque, Zerrin Yumak","doi":"arxiv-2409.07966","DOIUrl":"https://doi.org/arxiv-2409.07966","url":null,"abstract":"Audio-driven 3D facial animation synthesis has been an active field of\u0000research with attention from both academia and industry. While there are\u0000promising results in this area, recent approaches largely focus on lip-sync and\u0000identity control, neglecting the role of emotions and emotion control in the\u0000generative process. That is mainly due to the lack of emotionally rich facial\u0000animation data and algorithms that can synthesize speech animations with\u0000emotional expressions at the same time. In addition, majority of the models are\u0000deterministic, meaning given the same audio input, they produce the same output\u0000motion. We argue that emotions and non-determinism are crucial to generate\u0000diverse and emotionally-rich facial animations. In this paper, we propose\u0000ProbTalk3D a non-deterministic neural network approach for emotion controllable\u0000speech-driven 3D facial animation synthesis using a two-stage VQ-VAE model and\u0000an emotionally rich facial animation dataset 3DMEAD. We provide an extensive\u0000comparative analysis of our model against the recent 3D facial animation\u0000synthesis approaches, by evaluating the results objectively, qualitatively, and\u0000with a perceptual user study. We highlight several objective metrics that are\u0000more suitable for evaluating stochastic outputs and use both in-the-wild and\u0000ground truth data for subjective evaluation. To our knowledge, that is the\u0000first non-deterministic 3D facial animation synthesis method incorporating a\u0000rich emotion dataset and emotion control with emotion labels and intensity\u0000levels. Our evaluation demonstrates that the proposed model achieves superior\u0000performance compared to state-of-the-art emotion-controlled, deterministic and\u0000non-deterministic models. We recommend watching the supplementary video for\u0000quality judgement. The entire codebase is publicly available\u0000(https://github.com/uuembodiedsocialai/ProbTalk3D/).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bayesian Self-Training for Semi-Supervised 3D Segmentation 用于半监督三维分割的贝叶斯自我训练技术

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08102

Ozan Unal, Christos Sakaridis, Luc Van Gool

3D segmentation is a core problem in computer vision and, similarly to manyother dense prediction tasks, it requires large amounts of annotated data foradequate training. However, densely labeling 3D point clouds to employfully-supervised training remains too labor intensive and expensive.Semi-supervised training provides a more practical alternative, where only asmall set of labeled data is given, accompanied by a larger unlabeled set. Thisarea thus studies the effective use of unlabeled data to reduce the performancegap that arises due to the lack of annotations. In this work, inspired byBayesian deep learning, we first propose a Bayesian self-training framework forsemi-supervised 3D semantic segmentation. Employing stochastic inference, wegenerate an initial set of pseudo-labels and then filter these based onestimated point-wise uncertainty. By constructing a heuristic $n$-partitematching algorithm, we extend the method to semi-supervised 3D instancesegmentation, and finally, with the same building blocks, to dense 3D visualgrounding. We demonstrate state-of-the-art results for our semi-supervisedmethod on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and onScanNet and S3DIS for 3D instance segmentation. We further achieve substantialimprovements in dense 3D visual grounding over supervised-only baselines onScanRefer. Our project page is available at ouenal.github.io/bst/.

三维分割是计算机视觉领域的一个核心问题，与其他许多密集预测任务类似，它需要大量标注数据来进行适当的训练。半监督训练提供了一种更实用的替代方法，即只给出一小部分标注数据集，同时给出更大的未标注数据集。因此，该领域研究如何有效利用无标注数据，以缩小因缺乏注释而产生的性能差距。在这项工作中，受贝叶斯深度学习的启发，我们首先提出了一个用于半监督三维语义分割的贝叶斯自我训练框架。通过随机推理，我们生成了一组初始伪标签，然后根据估计的点向不确定性对这些伪标签进行过滤。通过构建一个启发式的 $n$ 部分匹配算法，我们将该方法扩展到半监督三维实例分割，最后，使用相同的构建模块，扩展到密集三维视觉地景。我们在 SemanticKITTI 和 ScribbleKITTI（用于三维语义分割）以及 ScanNet 和 S3DIS（用于三维实例分割）上展示了我们的半监督方法的最新成果。我们还进一步在ScanRefer上实现了密集三维视觉接地，比纯监督基线有了大幅提高。我们的项目页面位于 ouenal.github.io/bst/。

{"title":"Bayesian Self-Training for Semi-Supervised 3D Segmentation","authors":"Ozan Unal, Christos Sakaridis, Luc Van Gool","doi":"arxiv-2409.08102","DOIUrl":"https://doi.org/arxiv-2409.08102","url":null,"abstract":"3D segmentation is a core problem in computer vision and, similarly to many\u0000other dense prediction tasks, it requires large amounts of annotated data for\u0000adequate training. However, densely labeling 3D point clouds to employ\u0000fully-supervised training remains too labor intensive and expensive.\u0000Semi-supervised training provides a more practical alternative, where only a\u0000small set of labeled data is given, accompanied by a larger unlabeled set. This\u0000area thus studies the effective use of unlabeled data to reduce the performance\u0000gap that arises due to the lack of annotations. In this work, inspired by\u0000Bayesian deep learning, we first propose a Bayesian self-training framework for\u0000semi-supervised 3D semantic segmentation. Employing stochastic inference, we\u0000generate an initial set of pseudo-labels and then filter these based on\u0000estimated point-wise uncertainty. By constructing a heuristic $n$-partite\u0000matching algorithm, we extend the method to semi-supervised 3D instance\u0000segmentation, and finally, with the same building blocks, to dense 3D visual\u0000grounding. We demonstrate state-of-the-art results for our semi-supervised\u0000method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on\u0000ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial\u0000improvements in dense 3D visual grounding over supervised-only baselines on\u0000ScanRefer. Our project page is available at ouenal.github.io/bst/.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Depth on Demand: Streaming Dense Depth from a Low Frame Rate Active Sensor 按需深度：从低帧率有源传感器流式传输高密度深度数据

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08277

Andrea Conti, Matteo Poggi, Valerio Cambareri, Stefano Mattoccia

High frame rate and accurate depth estimation plays an important role inseveral tasks crucial to robotics and automotive perception. To date, this canbe achieved through ToF and LiDAR devices for indoor and outdoor applications,respectively. However, their applicability is limited by low frame rate, energyconsumption, and spatial sparsity. Depth on Demand (DoD) allows for accuratetemporal and spatial depth densification achieved by exploiting a high framerate RGB sensor coupled with a potentially lower frame rate and sparse activedepth sensor. Our proposal jointly enables lower energy consumption and densershape reconstruction, by significantly reducing the streaming requirements onthe depth sensor thanks to its three core stages: i) multi-modal encoding, ii)iterative multi-modal integration, and iii) depth decoding. We present extendedevidence assessing the effectiveness of DoD on indoor and outdoor videodatasets, covering both environment scanning and automotive perception usecases.

高帧率和精确的深度估计在机器人和汽车感知的多项关键任务中发挥着重要作用。迄今为止，可通过分别用于室内和室外应用的 ToF 和激光雷达设备实现这一目标。然而，它们的适用性受到低帧频、能耗和空间稀疏性的限制。按需深度（Depth on Demand，DoD）通过利用高帧率 RGB 传感器和潜在的低帧率稀疏深度传感器，实现了精确的时空深度密集化。我们的方案通过三个核心阶段：i）多模态编码；ii）迭代多模态整合；iii）深度解码，显著降低了对深度传感器的流媒体要求，从而实现了更低的能耗和更密集的形状重建。我们介绍了评估 DoD 在室内和室外视频数据集上有效性的扩展证据，涵盖了环境扫描和汽车感知用例。

引用次数: 0

Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation 通过即时插值进行噪声校正，实现基于扩散的图像间平移

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.08077

Junsung Lee, Minsoo Kang, Bohyung Han

We propose a simple but effective training-free approach tailored todiffusion-based image-to-image translation. Our approach revises the originalnoise prediction network of a pretrained diffusion model by introducing a noisecorrection term. We formulate the noise correction term as the differencebetween two noise predictions; one is computed from the denoising network witha progressive interpolation of the source and target prompt embeddings, whilethe other is the noise prediction with the source prompt embedding. The finalnoise prediction network is given by a linear combination of the standarddenoising term and the noise correction term, where the former is designed toreconstruct must-be-preserved regions while the latter aims to effectively editregions of interest relevant to the target prompt. Our approach can be easilyincorporated into existing image-to-image translation methods based ondiffusion models. Extensive experiments verify that the proposed techniqueachieves outstanding performance with low latency and consistently improvesexisting frameworks when combined with them.

我们针对基于扩散的图像到图像转换提出了一种简单而有效的免训练方法。我们的方法通过引入噪声校正项，修改了预训练扩散模型的原始噪声预测网络。我们将噪声校正项表述为两个噪声预测之间的差值；一个是通过对源和目标提示嵌入进行渐进插值的去噪网络计算得出的，另一个是通过源提示嵌入得出的噪声预测。最终的噪声预测网络由标准去噪项和噪声校正项的线性组合构成，前者旨在重建必须保留的区域，后者旨在有效编辑与目标提示相关的感兴趣区域。我们的方法可以轻松融入现有的基于扩散模型的图像到图像翻译方法中。广泛的实验验证了所提出的技术能以较低的延迟实现出色的性能，并在与现有框架相结合时持续改进现有框架。

{"title":"Diffusion-Based Image-to-Image Translation by Noise Correction via Prompt Interpolation","authors":"Junsung Lee, Minsoo Kang, Bohyung Han","doi":"arxiv-2409.08077","DOIUrl":"https://doi.org/arxiv-2409.08077","url":null,"abstract":"We propose a simple but effective training-free approach tailored to\u0000diffusion-based image-to-image translation. Our approach revises the original\u0000noise prediction network of a pretrained diffusion model by introducing a noise\u0000correction term. We formulate the noise correction term as the difference\u0000between two noise predictions; one is computed from the denoising network with\u0000a progressive interpolation of the source and target prompt embeddings, while\u0000the other is the noise prediction with the source prompt embedding. The final\u0000noise prediction network is given by a linear combination of the standard\u0000denoising term and the noise correction term, where the former is designed to\u0000reconstruct must-be-preserved regions while the latter aims to effectively edit\u0000regions of interest relevant to the target prompt. Our approach can be easily\u0000incorporated into existing image-to-image translation methods based on\u0000diffusion models. Extensive experiments verify that the proposed technique\u0000achieves outstanding performance with low latency and consistently improves\u0000existing frameworks when combined with them.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction 深度高度解耦实现基于视觉的精确三维占位预测

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-12 DOI: arxiv-2409.07972

Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang

The task of vision-based 3D occupancy prediction aims to reconstruct 3Dgeometry and estimate its semantic classes from 2D color images, where the2D-to-3D view transformation is an indispensable step. Most previous methodsconduct forward projection, such as BEVPooling and VoxelPooling, both of whichmap the 2D image features into 3D grids. However, the current grid representingfeatures within a certain height range usually introduces many confusingfeatures that belong to other height ranges. To address this challenge, wepresent Deep Height Decoupling (DHD), a novel framework that incorporatesexplicit height prior to filter out the confusing features. Specifically, DHDfirst predicts height maps via explicit supervision. Based on the heightdistribution statistics, DHD designs Mask Guided Height Sampling (MGHS) toadaptively decoupled the height map into multiple binary masks. MGHS projectsthe 2D image features into multiple subspaces, where each grid containsfeatures within reasonable height ranges. Finally, a Synergistic FeatureAggregation (SFA) module is deployed to enhance the feature representationthrough channel and spatial affinities, enabling further occupancy refinement.On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-artperformance even with minimal input frames. Code is available athttps://github.com/yanzq95/DHD.

基于视觉的三维占位预测任务旨在从二维彩色图像中重建三维几何图形并估计其语义类别，其中二维到三维的视图转换是不可或缺的一步。之前的大多数方法都是进行前向投影，如 BEVPooling 和 VoxelPooling，这两种方法都是将二维图像特征映射到三维网格中。然而，目前表示某一高度范围内特征的网格通常会引入许多属于其他高度范围的混淆特征。为了应对这一挑战，我们提出了深度高度解耦 (DHD)，这是一个新颖的框架，它结合了明确的高度先验来过滤掉混淆的特征。具体来说，DHD 首先通过显式监督来预测高度图。基于高度分布统计，DHD 设计了掩码引导高度采样（MGHS），以适应性地将高度图解耦为多个二进制掩码。MGHS 将二维图像特征投射到多个子空间中，每个网格包含合理高度范围内的特征。最后，我们部署了一个协同特征聚合（SFA）模块，通过通道和空间亲和力来增强特征表示，从而实现进一步的占位细化。在流行的 Occ3D-nuScenes 基准上，即使输入帧数极少，我们的方法也能达到最先进的性能。代码可在https://github.com/yanzq95/DHD。

{"title":"Deep Height Decoupling for Precise Vision-based 3D Occupancy Prediction","authors":"Yuan Wu, Zhiqiang Yan, Zhengxue Wang, Xiang Li, Le Hui, Jian Yang","doi":"arxiv-2409.07972","DOIUrl":"https://doi.org/arxiv-2409.07972","url":null,"abstract":"The task of vision-based 3D occupancy prediction aims to reconstruct 3D\u0000geometry and estimate its semantic classes from 2D color images, where the\u00002D-to-3D view transformation is an indispensable step. Most previous methods\u0000conduct forward projection, such as BEVPooling and VoxelPooling, both of which\u0000map the 2D image features into 3D grids. However, the current grid representing\u0000features within a certain height range usually introduces many confusing\u0000features that belong to other height ranges. To address this challenge, we\u0000present Deep Height Decoupling (DHD), a novel framework that incorporates\u0000explicit height prior to filter out the confusing features. Specifically, DHD\u0000first predicts height maps via explicit supervision. Based on the height\u0000distribution statistics, DHD designs Mask Guided Height Sampling (MGHS) to\u0000adaptively decoupled the height map into multiple binary masks. MGHS projects\u0000the 2D image features into multiple subspaces, where each grid contains\u0000features within reasonable height ranges. Finally, a Synergistic Feature\u0000Aggregation (SFA) module is deployed to enhance the feature representation\u0000through channel and spatial affinities, enabling further occupancy refinement.\u0000On the popular Occ3D-nuScenes benchmark, our method achieves state-of-the-art\u0000performance even with minimal input frames. Code is available at\u0000https://github.com/yanzq95/DHD.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0