首页 > 最新文献

arXiv - CS - Computer Vision and Pattern Recognition最新文献

英文 中文
Tracking Any Point with Frame-Event Fusion Network at High Frame Rate 利用帧-事件融合网络以高帧频跟踪任意点
Pub Date : 2024-09-18 DOI: arxiv-2409.11953
Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu
Tracking any point based on image frames is constrained by frame rates,leading to instability in high-speed scenarios and limited generalization inreal-world applications. To overcome these limitations, we propose animage-event fusion point tracker, FE-TAP, which combines the contextualinformation from image frames with the high temporal resolution of events,achieving high frame rate and robust point tracking under various challengingconditions. Specifically, we designed an Evolution Fusion module (EvoFusion) tomodel the image generation process guided by events. This module caneffectively integrate valuable information from both modalities operating atdifferent frequencies. To achieve smoother point trajectories, we employed atransformer-based refinement strategy that updates the point's trajectories andfeatures iteratively. Extensive experiments demonstrate that our methodoutperforms state-of-the-art approaches, particularly improving expectedfeature age by 24$%$ on EDS datasets. Finally, we qualitatively validated therobustness of our algorithm in real driving scenarios using our custom-designedhigh-resolution image-event synchronization device. Our source code will bereleased at https://github.com/ljx1002/FE-TAP.
基于图像帧跟踪任何点都会受到帧速率的限制,从而导致在高速场景中的不稳定性以及在真实世界应用中的有限通用性。为了克服这些限制,我们提出了动画-事件融合点跟踪器 FE-TAP,它将来自图像帧的上下文信息与事件的高时间分辨率相结合,在各种具有挑战性的条件下实现了高帧率和稳健的点跟踪。具体来说,我们设计了一个进化融合模块(EvoFusion)来模拟由事件引导的图像生成过程。该模块可以有效地整合来自两种工作频率不同的模态的有价值信息。为了获得更平滑的点轨迹,我们采用了基于变换器的细化策略,迭代更新点的轨迹和特征。大量实验证明,我们的方法优于最先进的方法,特别是在 EDS 数据集上,预期特征年龄提高了 24%。最后,我们使用定制设计的高分辨率图像事件同步设备,在实际驾驶场景中定性验证了我们算法的稳健性。我们的源代码将发布在 https://github.com/ljx1002/FE-TAP 网站上。
{"title":"Tracking Any Point with Frame-Event Fusion Network at High Frame Rate","authors":"Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu","doi":"arxiv-2409.11953","DOIUrl":"https://doi.org/arxiv-2409.11953","url":null,"abstract":"Tracking any point based on image frames is constrained by frame rates,\u0000leading to instability in high-speed scenarios and limited generalization in\u0000real-world applications. To overcome these limitations, we propose an\u0000image-event fusion point tracker, FE-TAP, which combines the contextual\u0000information from image frames with the high temporal resolution of events,\u0000achieving high frame rate and robust point tracking under various challenging\u0000conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to\u0000model the image generation process guided by events. This module can\u0000effectively integrate valuable information from both modalities operating at\u0000different frequencies. To achieve smoother point trajectories, we employed a\u0000transformer-based refinement strategy that updates the point's trajectories and\u0000features iteratively. Extensive experiments demonstrate that our method\u0000outperforms state-of-the-art approaches, particularly improving expected\u0000feature age by 24$%$ on EDS datasets. Finally, we qualitatively validated the\u0000robustness of our algorithm in real driving scenarios using our custom-designed\u0000high-resolution image-event synchronization device. Our source code will be\u0000released at https://github.com/ljx1002/FE-TAP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Panoptic-Depth Forecasting 全景深度预测
Pub Date : 2024-09-18 DOI: arxiv-2409.12008
Juana Valeria Hurtado, Riya Mohan, Abhinav Valada
Forecasting the semantics and 3D structure of scenes is essential for robotsto navigate and plan actions safely. Recent methods have explored semantic andpanoptic scene forecasting; however, they do not consider the geometry of thescene. In this work, we propose the panoptic-depth forecasting task for jointlypredicting the panoptic segmentation and depth maps of unobserved futureframes, from monocular camera images. To facilitate this work, we extend thepopular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDARpoint clouds and leveraging sequential labeled data. We also introduce asuitable evaluation metric that quantifies both the panoptic quality and depthestimation accuracy of forecasts in a coherent manner. Furthermore, we presenttwo baselines and propose the novel PDcast architecture that learns richspatio-temporal representations by incorporating a transformer-based encoder, aforecasting module, and task-specific decoders to predict future panoptic-depthoutputs. Extensive evaluations demonstrate the effectiveness of PDcast acrosstwo datasets and three forecasting tasks, consistently addressing the primarychallenges. We make the code publicly available athttps://pdcast.cs.uni-freiburg.de.
预测场景的语义和三维结构对于机器人安全导航和规划行动至关重要。最近的方法已经探索了语义和全视角场景预测,但是它们没有考虑场景的几何结构。在这项工作中,我们提出了全景深度预测任务,用于从单目摄像机图像中联合预测未观察到的未来帧的全景分割和深度图。为了促进这项工作,我们扩展了流行的 KITTI-360 和 Cityscapes 基准,通过激光雷达点云计算深度图,并利用连续标记数据。我们还引入了合适的评估指标,以一致的方式量化预测的全景质量和深度估计精度。此外,我们提出了两个基线,并提出了新颖的 PDcast 架构,该架构通过整合基于变压器的编码器、前述预测模块和特定任务解码器来学习丰富的时空表示,从而预测未来的全景深度输出。广泛的评估证明了 PDcast 在两个数据集和三个预测任务中的有效性,始终如一地解决了主要挑战。我们公开了代码,网址是:https://pdcast.cs.uni-freiburg.de。
{"title":"Panoptic-Depth Forecasting","authors":"Juana Valeria Hurtado, Riya Mohan, Abhinav Valada","doi":"arxiv-2409.12008","DOIUrl":"https://doi.org/arxiv-2409.12008","url":null,"abstract":"Forecasting the semantics and 3D structure of scenes is essential for robots\u0000to navigate and plan actions safely. Recent methods have explored semantic and\u0000panoptic scene forecasting; however, they do not consider the geometry of the\u0000scene. In this work, we propose the panoptic-depth forecasting task for jointly\u0000predicting the panoptic segmentation and depth maps of unobserved future\u0000frames, from monocular camera images. To facilitate this work, we extend the\u0000popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR\u0000point clouds and leveraging sequential labeled data. We also introduce a\u0000suitable evaluation metric that quantifies both the panoptic quality and depth\u0000estimation accuracy of forecasts in a coherent manner. Furthermore, we present\u0000two baselines and propose the novel PDcast architecture that learns rich\u0000spatio-temporal representations by incorporating a transformer-based encoder, a\u0000forecasting module, and task-specific decoders to predict future panoptic-depth\u0000outputs. Extensive evaluations demonstrate the effectiveness of PDcast across\u0000two datasets and three forecasting tasks, consistently addressing the primary\u0000challenges. We make the code publicly available at\u0000https://pdcast.cs.uni-freiburg.de.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Differentiable Collision-Supervised Tooth Arrangement Network with a Decoupling Perspective 从解耦角度看可微分碰撞监督齿排列网络
Pub Date : 2024-09-18 DOI: arxiv-2409.11937
Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang
Tooth arrangement is an essential step in the digital orthodontic planningprocess. Existing learning-based methods use hidden teeth features to directlyregress teeth motions, which couples target pose perception and motionregression. It could lead to poor perceptions of three-dimensionaltransformation. They also ignore the possible overlaps or gaps between teeth ofpredicted dentition, which is generally unacceptable. Therefore, we proposeDTAN, a differentiable collision-supervised tooth arrangement network,decoupling predicting tasks and feature modeling. DTAN decouples the tootharrangement task by first predicting the hidden features of the final teethposes and then using them to assist in regressing the motions between thebeginning and target teeth. To learn the hidden features better, DTAN alsodecouples the teeth-hidden features into geometric and positional features,which are further supervised by feature consistency constraints. Furthermore,we propose a novel differentiable collision loss function for point cloud datato constrain the related gestures between teeth, which can be easily extendedto other 3D point cloud tasks. We propose an arch-width guided tootharrangement network, named C-DTAN, to make the results controllable. Weconstruct three different tooth arrangement datasets and achieve drasticallyimproved performance on accuracy and speed compared with existing methods.
牙齿排列是数字正畸规划过程中必不可少的一步。现有的基于学习的方法使用隐藏的牙齿特征直接回归牙齿运动,将目标姿势感知和运动回归结合在一起。这可能导致对三维变换的感知不佳。它们还忽略了预测牙列中牙齿之间可能存在的重叠或间隙,这通常是不可接受的。因此,我们提出了一种可区分的碰撞监督牙齿排列网络--DTAN,它将预测任务和特征建模分离开来。DTAN 将牙齿排列任务解耦,首先预测最终牙齿排列的隐藏特征,然后利用这些特征辅助回归起始牙齿和目标牙齿之间的运动。为了更好地学习隐藏特征,DTAN 还将牙齿隐藏特征分解为几何特征和位置特征,并通过特征一致性约束对其进行进一步监督。此外,我们还为点云数据提出了一种新颖的可微分碰撞损失函数,用于约束牙齿之间的相关手势,该函数可轻松扩展到其他三维点云任务中。我们提出了一种名为 C-DTAN 的牙弓宽度引导的牙齿排列网络,以实现结果的可控性。我们构建了三个不同的牙齿排列数据集,与现有方法相比,在准确性和速度上都有大幅提高。
{"title":"Differentiable Collision-Supervised Tooth Arrangement Network with a Decoupling Perspective","authors":"Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang","doi":"arxiv-2409.11937","DOIUrl":"https://doi.org/arxiv-2409.11937","url":null,"abstract":"Tooth arrangement is an essential step in the digital orthodontic planning\u0000process. Existing learning-based methods use hidden teeth features to directly\u0000regress teeth motions, which couples target pose perception and motion\u0000regression. It could lead to poor perceptions of three-dimensional\u0000transformation. They also ignore the possible overlaps or gaps between teeth of\u0000predicted dentition, which is generally unacceptable. Therefore, we propose\u0000DTAN, a differentiable collision-supervised tooth arrangement network,\u0000decoupling predicting tasks and feature modeling. DTAN decouples the tooth\u0000arrangement task by first predicting the hidden features of the final teeth\u0000poses and then using them to assist in regressing the motions between the\u0000beginning and target teeth. To learn the hidden features better, DTAN also\u0000decouples the teeth-hidden features into geometric and positional features,\u0000which are further supervised by feature consistency constraints. Furthermore,\u0000we propose a novel differentiable collision loss function for point cloud data\u0000to constrain the related gestures between teeth, which can be easily extended\u0000to other 3D point cloud tasks. We propose an arch-width guided tooth\u0000arrangement network, named C-DTAN, to make the results controllable. We\u0000construct three different tooth arrangement datasets and achieve drastically\u0000improved performance on accuracy and speed compared with existing methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-End Probabilistic Geometry-Guided Regression for 6DoF Object Pose Estimation 用于 6DoF 物体姿态估计的端到端概率几何引导回归技术
Pub Date : 2024-09-18 DOI: arxiv-2409.11819
Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper
6D object pose estimation is the problem of identifying the position andorientation of an object relative to a chosen coordinate system, which is acore technology for modern XR applications. State-of-the-art 6D object poseestimators directly predict an object pose given an object observation. Due tothe ill-posed nature of the pose estimation problem, where multiple differentposes can correspond to a single observation, generating additional plausibleestimates per observation can be valuable. To address this, we reformulate thestate-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-EndProbabilistic Geometry-Guided Regression). Instead of predicting a single poseper detection, we estimate a probability density distribution of the pose.Using the evaluation procedure defined by the BOP (Benchmark for 6D Object PoseEstimation) Challenge, we test our approach on four of its core datasets anddemonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, andITODD. Our probabilistic solution shows that predicting a pose distributioninstead of a single pose can improve state-of-the-art single-view poseestimation while providing the additional benefit of being able to samplemultiple meaningful pose candidates.
6D 物体姿态估计是识别物体相对于选定坐标系的位置和方向的问题,是现代 XR 应用的核心技术。最先进的 6D 物体姿态估计器可在观测到物体的情况下直接预测物体姿态。由于姿态估计问题的多假设性质,一个观测值可能对应多个不同的姿态,因此为每个观测值生成额外的可信估计值非常有价值。为了解决这个问题,我们重新制定了最先进的算法 GDRNPP,并引入了 EPRO-GDR(端到端概率几何引导回归)。使用 BOP(6D 物体姿态估计基准)挑战赛定义的评估程序,我们在其四个核心数据集上测试了我们的方法,并在 LM-O、YCB-V 和 ITODD 上展示了 EPRO-GDR 优越的定量结果。我们的概率解决方案表明,预测姿态分布而非单一姿态可以改进最先进的单视角姿态估算,同时还能提供对多个有意义的候选姿态进行采样的额外好处。
{"title":"End-to-End Probabilistic Geometry-Guided Regression for 6DoF Object Pose Estimation","authors":"Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper","doi":"arxiv-2409.11819","DOIUrl":"https://doi.org/arxiv-2409.11819","url":null,"abstract":"6D object pose estimation is the problem of identifying the position and\u0000orientation of an object relative to a chosen coordinate system, which is a\u0000core technology for modern XR applications. State-of-the-art 6D object pose\u0000estimators directly predict an object pose given an object observation. Due to\u0000the ill-posed nature of the pose estimation problem, where multiple different\u0000poses can correspond to a single observation, generating additional plausible\u0000estimates per observation can be valuable. To address this, we reformulate the\u0000state-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-End\u0000Probabilistic Geometry-Guided Regression). Instead of predicting a single pose\u0000per detection, we estimate a probability density distribution of the pose.\u0000Using the evaluation procedure defined by the BOP (Benchmark for 6D Object Pose\u0000Estimation) Challenge, we test our approach on four of its core datasets and\u0000demonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, and\u0000ITODD. Our probabilistic solution shows that predicting a pose distribution\u0000instead of a single pose can improve state-of-the-art single-view pose\u0000estimation while providing the additional benefit of being able to sample\u0000multiple meaningful pose candidates.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation JEAN: 基于联合表情和音频引导的 NeRF 会说话人脸生成技术
Pub Date : 2024-09-18 DOI: arxiv-2409.12156
Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras
We introduce a novel method for joint expression and audio-guided talkingface generation. Recent approaches either struggle to preserve the speakeridentity or fail to produce faithful facial expressions. To address thesechallenges, we propose a NeRF-based network. Since we train our network onmonocular videos without any ground truth, it is essential to learndisentangled representations for audio and expression. We first learn audiofeatures in a self-supervised manner, given utterances from multiple subjects.By incorporating a contrastive learning technique, we ensure that the learnedaudio features are aligned to the lip motion and disentangled from the musclemotion of the rest of the face. We then devise a transformer-based architecturethat learns expression features, capturing long-range facial expressions anddisentangling them from the speech-specific mouth movements. Throughquantitative and qualitative evaluation, we demonstrate that our method cansynthesize high-fidelity talking face videos, achieving state-of-the-art facialexpression transfer along with lip synchronization to unseen audio.
我们介绍了一种联合表情和音频引导的谈话面孔生成新方法。最近的方法要么难以保持说话者的身份,要么无法生成忠实的面部表情。为了应对这些挑战,我们提出了一种基于 NeRF 的网络。由于我们的网络是在没有任何地面实况的单目视频上进行训练的,因此必须学习音频和表情的分离表征。通过采用对比学习技术,我们确保学习到的音频特征与嘴唇运动保持一致,并与面部其他部位的肌肉运动相分离。然后,我们设计了一种基于变压器的架构,该架构可学习表情特征,捕捉远距离面部表情,并将其与特定于语音的嘴部运动分离开来。通过定量和定性评估,我们证明了我们的方法可以合成高保真的会说话的面部视频,实现最先进的面部表情转移以及与未见音频的唇部同步。
{"title":"JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation","authors":"Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras","doi":"arxiv-2409.12156","DOIUrl":"https://doi.org/arxiv-2409.12156","url":null,"abstract":"We introduce a novel method for joint expression and audio-guided talking\u0000face generation. Recent approaches either struggle to preserve the speaker\u0000identity or fail to produce faithful facial expressions. To address these\u0000challenges, we propose a NeRF-based network. Since we train our network on\u0000monocular videos without any ground truth, it is essential to learn\u0000disentangled representations for audio and expression. We first learn audio\u0000features in a self-supervised manner, given utterances from multiple subjects.\u0000By incorporating a contrastive learning technique, we ensure that the learned\u0000audio features are aligned to the lip motion and disentangled from the muscle\u0000motion of the rest of the face. We then devise a transformer-based architecture\u0000that learns expression features, capturing long-range facial expressions and\u0000disentangling them from the speech-specific mouth movements. Through\u0000quantitative and qualitative evaluation, we demonstrate that our method can\u0000synthesize high-fidelity talking face videos, achieving state-of-the-art facial\u0000expression transfer along with lip synchronization to unseen audio.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Agglomerative Token Clustering 聚类令牌聚类
Pub Date : 2024-09-18 DOI: arxiv-2409.11923
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund
We present Agglomerative Token Clustering (ATC), a novel token merging methodthat consistently outperforms previous token merging and pruning methods acrossimage classification, image synthesis, and object detection & segmentationtasks. ATC merges clusters through bottom-up hierarchical clustering, withoutthe introduction of extra learnable parameters. We find that ATC achievesstate-of-the-art performance across all tasks, and can even perform on par withprior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning.ATC is particularly effective when applied with low keep rates, where only asmall fraction of tokens are kept and retaining task performance is especiallydifficult.
我们提出了标记聚合法(Agglomerative Token Clustering,ATC),这是一种新颖的标记合并方法,在图像分类、图像合成以及物体检测与分割任务中,其性能始终优于之前的标记合并和剪枝方法。ATC 通过自下而上的分层聚类来合并集群,无需引入额外的可学习参数。我们发现 ATC 在所有任务中都达到了最先进的性能,甚至在现成应用(即不进行微调)的情况下,其性能可以与之前最先进的技术相媲美。ATC 在低保留率情况下的应用尤其有效,在这种情况下,只有一小部分标记被保留,保持任务性能尤其困难。
{"title":"Agglomerative Token Clustering","authors":"Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund","doi":"arxiv-2409.11923","DOIUrl":"https://doi.org/arxiv-2409.11923","url":null,"abstract":"We present Agglomerative Token Clustering (ATC), a novel token merging method\u0000that consistently outperforms previous token merging and pruning methods across\u0000image classification, image synthesis, and object detection & segmentation\u0000tasks. ATC merges clusters through bottom-up hierarchical clustering, without\u0000the introduction of extra learnable parameters. We find that ATC achieves\u0000state-of-the-art performance across all tasks, and can even perform on par with\u0000prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning.\u0000ATC is particularly effective when applied with low keep rates, where only a\u0000small fraction of tokens are kept and retaining task performance is especially\u0000difficult.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models LLM-wrapper:视觉语言基础模型的黑盒语义感知适配
Pub Date : 2024-09-18 DOI: arxiv-2409.11919
Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord
Vision Language Models (VLMs) have shown impressive performances on numeroustasks but their zero-shot capabilities can be limited compared to dedicated orfine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires`white-box' access to the model's architecture and weights as well as expertiseto design the fine-tuning objectives and optimize the hyper-parameters, whichare specific to each VLM and downstream task. In this work, we proposeLLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner byleveraging large language models (LLMs) so as to reason on their outputs. Wedemonstrate the effectiveness of LLM-wrapper on Referring ExpressionComprehension (REC), a challenging open-vocabulary task that requires spatialand semantic reasoning. Our approach significantly boosts the performance ofoff-the-shelf models, resulting in competitive results when compared withclassic fine-tuning.
视觉语言模型(VLM)在大量任务中表现出令人印象深刻的性能,但与专用模型或微调模型相比,它们的零拍摄能力可能有限。然而,对 VLM 进行微调也有其局限性,因为它需要 "白盒 "访问模型的架构和权重,还需要专家来设计微调目标和优化超参数,这些都是每个 VLM 和下游任务所特有的。在这项工作中,我们提出了LLM-wrapper,这是一种以 "黑箱 "方式调整VLM的新方法,通过利用大型语言模型(LLM)来对其输出进行推理。我们演示了 LLM-wrapper 在参考表达式理解(REC)上的有效性,这是一项具有挑战性的开放词汇任务,需要空间和语义推理。我们的方法大大提高了现成模型的性能,与传统的微调方法相比,结果极具竞争力。
{"title":"LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models","authors":"Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord","doi":"arxiv-2409.11919","DOIUrl":"https://doi.org/arxiv-2409.11919","url":null,"abstract":"Vision Language Models (VLMs) have shown impressive performances on numerous\u0000tasks but their zero-shot capabilities can be limited compared to dedicated or\u0000fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires\u0000`white-box' access to the model's architecture and weights as well as expertise\u0000to design the fine-tuning objectives and optimize the hyper-parameters, which\u0000are specific to each VLM and downstream task. In this work, we propose\u0000LLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by\u0000leveraging large language models (LLMs) so as to reason on their outputs. We\u0000demonstrate the effectiveness of LLM-wrapper on Referring Expression\u0000Comprehension (REC), a challenging open-vocabulary task that requires spatial\u0000and semantic reasoning. Our approach significantly boosts the performance of\u0000off-the-shelf models, resulting in competitive results when compared with\u0000classic fine-tuning.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models InverseMeetInsert:通过引导扩散模型中的几何累积反演进行稳健的真实图像编辑
Pub Date : 2024-09-18 DOI: arxiv-2409.11734
Yan Zheng, Lemeng Wu
In this paper, we introduce Geometry-Inverse-Meet-Pixel-Insert, short forGEO, an exceptionally versatile image editing technique designed to cater tocustomized user requirements at both local and global scales. Our approachseamlessly integrates text prompts and image prompts to yield diverse andprecise editing outcomes. Notably, our method operates without the need fortraining and is driven by two key contributions: (i) a novel geometricaccumulation loss that enhances DDIM inversion to faithfully preserve pixelspace geometry and layout, and (ii) an innovative boosted image prompttechnique that combines pixel-level editing for text-only inversion with latentspace geometry guidance for standard classifier-free reversion. Leveraging thepublicly available Stable Diffusion model, our approach undergoes extensiveevaluation across various image types and challenging prompt editing scenarios,consistently delivering high-fidelity editing results for real images.
在本文中,我们介绍了几何反转与像素插入(Geometry-Inverse-Meet-Pixel-Insert,简称GEO),这是一种非常灵活的图像编辑技术,旨在满足用户在局部和全局范围内的个性化需求。我们的方法将文本提示和图像提示完美地结合在一起,从而产生多样化的精确编辑结果。值得注意的是,我们的方法无需训练即可运行,并由两个关键贡献驱动:(i) 一种新颖的几何累积损失,可增强 DDIM 反演以忠实保留像素空间的几何和布局;(ii) 一种创新的增强图像提示技术,可将纯文本反演的像素级编辑与标准无分类器反演的潜空间几何引导相结合。利用公开的稳定扩散模型,我们的方法在各种图像类型和具有挑战性的提示编辑场景中进行了广泛的评估,始终如一地为真实图像提供高保真编辑结果。
{"title":"InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models","authors":"Yan Zheng, Lemeng Wu","doi":"arxiv-2409.11734","DOIUrl":"https://doi.org/arxiv-2409.11734","url":null,"abstract":"In this paper, we introduce Geometry-Inverse-Meet-Pixel-Insert, short for\u0000GEO, an exceptionally versatile image editing technique designed to cater to\u0000customized user requirements at both local and global scales. Our approach\u0000seamlessly integrates text prompts and image prompts to yield diverse and\u0000precise editing outcomes. Notably, our method operates without the need for\u0000training and is driven by two key contributions: (i) a novel geometric\u0000accumulation loss that enhances DDIM inversion to faithfully preserve pixel\u0000space geometry and layout, and (ii) an innovative boosted image prompt\u0000technique that combines pixel-level editing for text-only inversion with latent\u0000space geometry guidance for standard classifier-free reversion. Leveraging the\u0000publicly available Stable Diffusion model, our approach undergoes extensive\u0000evaluation across various image types and challenging prompt editing scenarios,\u0000consistently delivering high-fidelity editing results for real images.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distillation-free Scaling of Large SSMs for Images and Videos 图像和视频大型 SSM 的无蒸馏缩放
Pub Date : 2024-09-18 DOI: arxiv-2409.11867
Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall
State-space models (SSMs), exemplified by S4, have introduced a novel contextmodeling method by integrating state-space techniques into deep learning.However, they struggle with global context modeling due to theirdata-independent matrices. The Mamba model addressed this with data-dependentvariants via the S6 selective-scan algorithm, enhancing context modeling,especially for long sequences. However, Mamba-based architectures are difficultto scale with respect to the number of parameters, which is a major limitationfor vision applications. This paper addresses the scalability issue of largeSSMs for image classification and action recognition without requiringadditional techniques like knowledge distillation. We analyze the distinctcharacteristics of Mamba-based and Attention-based models, proposing aMamba-Attention interleaved architecture that enhances scalability, robustness,and performance. We demonstrate that the stable and efficient interleavedarchitecture resolves the scalability issue of Mamba-based architectures forimages and videos and increases robustness to common artifacts like JPEGcompression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 andSomething-Something-v2 benchmarks demonstrates that our approach improves theaccuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
以 S4 为代表的状态空间模型(SSM)通过将状态空间技术整合到深度学习中,引入了一种新的上下文建模方法。Mamba 模型通过 S6 选择性扫描算法,利用与数据相关的变量解决了这一问题,从而增强了上下文建模能力,尤其是对于长序列。然而,基于 Mamba 的架构很难扩展参数的数量,这是视觉应用的一大局限。本文在不需要知识提炼等附加技术的情况下,解决了用于图像分类和动作识别的大型 SSM 的可扩展性问题。我们分析了基于 Mamba 的模型和基于 Attention 的模型的不同特点,提出了一种可增强可扩展性、鲁棒性和性能的 Mamba-Attention 交错架构。我们证明,这种稳定高效的交错架构解决了基于 Mamba 架构的图像和视频可扩展性问题,并增强了对 JPEG 压缩等常见伪影的鲁棒性。我们在 ImageNet-1K、Kinetics-400 和 Something-Something-v2 基准上进行的全面评估表明,我们的方法将基于 Mamba 的最先进架构的准确性提高了高达 $+1.7$。
{"title":"Distillation-free Scaling of Large SSMs for Images and Videos","authors":"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall","doi":"arxiv-2409.11867","DOIUrl":"https://doi.org/arxiv-2409.11867","url":null,"abstract":"State-space models (SSMs), exemplified by S4, have introduced a novel context\u0000modeling method by integrating state-space techniques into deep learning.\u0000However, they struggle with global context modeling due to their\u0000data-independent matrices. The Mamba model addressed this with data-dependent\u0000variants via the S6 selective-scan algorithm, enhancing context modeling,\u0000especially for long sequences. However, Mamba-based architectures are difficult\u0000to scale with respect to the number of parameters, which is a major limitation\u0000for vision applications. This paper addresses the scalability issue of large\u0000SSMs for image classification and action recognition without requiring\u0000additional techniques like knowledge distillation. We analyze the distinct\u0000characteristics of Mamba-based and Attention-based models, proposing a\u0000Mamba-Attention interleaved architecture that enhances scalability, robustness,\u0000and performance. We demonstrate that the stable and efficient interleaved\u0000architecture resolves the scalability issue of Mamba-based architectures for\u0000images and videos and increases robustness to common artifacts like JPEG\u0000compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\u0000Something-Something-v2 benchmarks demonstrates that our approach improves the\u0000accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Chinese Continuous Sign Language Dataset Based on Complex Environments 基于复杂环境的中文连续手语数据集
Pub Date : 2024-09-18 DOI: arxiv-2409.11960
Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan
The current bottleneck in continuous sign language recognition (CSLR)research lies in the fact that most publicly available datasets are limited tolaboratory environments or television program recordings, resulting in a singlebackground environment with uniform lighting, which significantly deviates fromthe diversity and complexity found in real-life scenarios. To address thischallenge, we have constructed a new, large-scale dataset for Chinesecontinuous sign language (CSL) based on complex environments, termed thecomplex environment - chinese sign language dataset (CE-CSL). This datasetencompasses 5,988 continuous CSL video clips collected from daily life scenes,featuring more than 70 different complex backgrounds to ensurerepresentativeness and generalization capability. To tackle the impact ofcomplex backgrounds on CSLR performance, we propose a time-frequency network(TFNet) model for continuous sign language recognition. This model extractsframe-level features and then utilizes both temporal and spectral informationto separately derive sequence features before fusion, aiming to achieveefficient and accurate CSLR. Experimental results demonstrate that our approachachieves significant performance improvements on the CE-CSL, validating itseffectiveness under complex background conditions. Additionally, our proposedmethod has also yielded highly competitive results when applied to threepublicly available CSL datasets.
目前连续手语识别(CSLR)研究的瓶颈在于,大多数公开可用的数据集仅限于实验室环境或电视节目录制,导致背景环境单一、光照均匀,与现实生活中的多样性和复杂性大相径庭。为了应对这一挑战,我们构建了一个基于复杂环境的全新大规模中文连续手语(CSL)数据集,称为 "复杂环境-中文手语数据集"(CE-CSL)。该数据集包含 5,988 个从日常生活场景中采集的连续手语视频片段,具有 70 多种不同的复杂背景,以确保代表性和泛化能力。针对复杂背景对 CSLR 性能的影响,我们提出了一种用于连续手语识别的时频网络(TFNet)模型。该模型提取帧级特征,然后利用时间信息和频谱信息分别提取序列特征,再进行融合,从而实现高效、准确的 CSLR。实验结果表明,我们的方法显著提高了 CE-CSL 的性能,验证了它在复杂背景条件下的有效性。此外,我们提出的方法在应用于三个公开的 CSL 数据集时也取得了极具竞争力的结果。
{"title":"A Chinese Continuous Sign Language Dataset Based on Complex Environments","authors":"Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan","doi":"arxiv-2409.11960","DOIUrl":"https://doi.org/arxiv-2409.11960","url":null,"abstract":"The current bottleneck in continuous sign language recognition (CSLR)\u0000research lies in the fact that most publicly available datasets are limited to\u0000laboratory environments or television program recordings, resulting in a single\u0000background environment with uniform lighting, which significantly deviates from\u0000the diversity and complexity found in real-life scenarios. To address this\u0000challenge, we have constructed a new, large-scale dataset for Chinese\u0000continuous sign language (CSL) based on complex environments, termed the\u0000complex environment - chinese sign language dataset (CE-CSL). This dataset\u0000encompasses 5,988 continuous CSL video clips collected from daily life scenes,\u0000featuring more than 70 different complex backgrounds to ensure\u0000representativeness and generalization capability. To tackle the impact of\u0000complex backgrounds on CSLR performance, we propose a time-frequency network\u0000(TFNet) model for continuous sign language recognition. This model extracts\u0000frame-level features and then utilizes both temporal and spectral information\u0000to separately derive sequence features before fusion, aiming to achieve\u0000efficient and accurate CSLR. Experimental results demonstrate that our approach\u0000achieves significant performance improvements on the CE-CSL, validating its\u0000effectiveness under complex background conditions. Additionally, our proposed\u0000method has also yielded highly competitive results when applied to three\u0000publicly available CSL datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Computer Vision and Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1