Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu
Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the point's trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24$%$ on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed high-resolution image-event synchronization device. Our source code will be released at https://github.com/ljx1002/FE-TAP.
{"title":"Tracking Any Point with Frame-Event Fusion Network at High Frame Rate","authors":"Jiaxiong Liu, Bo Wang, Zhen Tan, Jinpu Zhang, Hui Shen, Dewen Hu","doi":"arxiv-2409.11953","DOIUrl":"https://doi.org/arxiv-2409.11953","url":null,"abstract":"Tracking any point based on image frames is constrained by frame rates,\u0000leading to instability in high-speed scenarios and limited generalization in\u0000real-world applications. To overcome these limitations, we propose an\u0000image-event fusion point tracker, FE-TAP, which combines the contextual\u0000information from image frames with the high temporal resolution of events,\u0000achieving high frame rate and robust point tracking under various challenging\u0000conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to\u0000model the image generation process guided by events. This module can\u0000effectively integrate valuable information from both modalities operating at\u0000different frequencies. To achieve smoother point trajectories, we employed a\u0000transformer-based refinement strategy that updates the point's trajectories and\u0000features iteratively. Extensive experiments demonstrate that our method\u0000outperforms state-of-the-art approaches, particularly improving expected\u0000feature age by 24$%$ on EDS datasets. Finally, we qualitatively validated the\u0000robustness of our algorithm in real driving scenarios using our custom-designed\u0000high-resolution image-event synchronization device. Our source code will be\u0000released at https://github.com/ljx1002/FE-TAP.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Forecasting the semantics and 3D structure of scenes is essential for robots to navigate and plan actions safely. Recent methods have explored semantic and panoptic scene forecasting; however, they do not consider the geometry of the scene. In this work, we propose the panoptic-depth forecasting task for jointly predicting the panoptic segmentation and depth maps of unobserved future frames, from monocular camera images. To facilitate this work, we extend the popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR point clouds and leveraging sequential labeled data. We also introduce a suitable evaluation metric that quantifies both the panoptic quality and depth estimation accuracy of forecasts in a coherent manner. Furthermore, we present two baselines and propose the novel PDcast architecture that learns rich spatio-temporal representations by incorporating a transformer-based encoder, a forecasting module, and task-specific decoders to predict future panoptic-depth outputs. Extensive evaluations demonstrate the effectiveness of PDcast across two datasets and three forecasting tasks, consistently addressing the primary challenges. We make the code publicly available at https://pdcast.cs.uni-freiburg.de.
{"title":"Panoptic-Depth Forecasting","authors":"Juana Valeria Hurtado, Riya Mohan, Abhinav Valada","doi":"arxiv-2409.12008","DOIUrl":"https://doi.org/arxiv-2409.12008","url":null,"abstract":"Forecasting the semantics and 3D structure of scenes is essential for robots\u0000to navigate and plan actions safely. Recent methods have explored semantic and\u0000panoptic scene forecasting; however, they do not consider the geometry of the\u0000scene. In this work, we propose the panoptic-depth forecasting task for jointly\u0000predicting the panoptic segmentation and depth maps of unobserved future\u0000frames, from monocular camera images. To facilitate this work, we extend the\u0000popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR\u0000point clouds and leveraging sequential labeled data. We also introduce a\u0000suitable evaluation metric that quantifies both the panoptic quality and depth\u0000estimation accuracy of forecasts in a coherent manner. Furthermore, we present\u0000two baselines and propose the novel PDcast architecture that learns rich\u0000spatio-temporal representations by incorporating a transformer-based encoder, a\u0000forecasting module, and task-specific decoders to predict future panoptic-depth\u0000outputs. Extensive evaluations demonstrate the effectiveness of PDcast across\u0000two datasets and three forecasting tasks, consistently addressing the primary\u0000challenges. We make the code publicly available at\u0000https://pdcast.cs.uni-freiburg.de.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"188 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang
Tooth arrangement is an essential step in the digital orthodontic planning process. Existing learning-based methods use hidden teeth features to directly regress teeth motions, which couples target pose perception and motion regression. It could lead to poor perceptions of three-dimensional transformation. They also ignore the possible overlaps or gaps between teeth of predicted dentition, which is generally unacceptable. Therefore, we propose DTAN, a differentiable collision-supervised tooth arrangement network, decoupling predicting tasks and feature modeling. DTAN decouples the tooth arrangement task by first predicting the hidden features of the final teeth poses and then using them to assist in regressing the motions between the beginning and target teeth. To learn the hidden features better, DTAN also decouples the teeth-hidden features into geometric and positional features, which are further supervised by feature consistency constraints. Furthermore, we propose a novel differentiable collision loss function for point cloud data to constrain the related gestures between teeth, which can be easily extended to other 3D point cloud tasks. We propose an arch-width guided tooth arrangement network, named C-DTAN, to make the results controllable. We construct three different tooth arrangement datasets and achieve drastically improved performance on accuracy and speed compared with existing methods.
{"title":"Differentiable Collision-Supervised Tooth Arrangement Network with a Decoupling Perspective","authors":"Zhihui He, Chengyuan Wang, Shidong Yang, Li Chen, Yanheng Zhou, Shuo Wang","doi":"arxiv-2409.11937","DOIUrl":"https://doi.org/arxiv-2409.11937","url":null,"abstract":"Tooth arrangement is an essential step in the digital orthodontic planning\u0000process. Existing learning-based methods use hidden teeth features to directly\u0000regress teeth motions, which couples target pose perception and motion\u0000regression. It could lead to poor perceptions of three-dimensional\u0000transformation. They also ignore the possible overlaps or gaps between teeth of\u0000predicted dentition, which is generally unacceptable. Therefore, we propose\u0000DTAN, a differentiable collision-supervised tooth arrangement network,\u0000decoupling predicting tasks and feature modeling. DTAN decouples the tooth\u0000arrangement task by first predicting the hidden features of the final teeth\u0000poses and then using them to assist in regressing the motions between the\u0000beginning and target teeth. To learn the hidden features better, DTAN also\u0000decouples the teeth-hidden features into geometric and positional features,\u0000which are further supervised by feature consistency constraints. Furthermore,\u0000we propose a novel differentiable collision loss function for point cloud data\u0000to constrain the related gestures between teeth, which can be easily extended\u0000to other 3D point cloud tasks. We propose an arch-width guided tooth\u0000arrangement network, named C-DTAN, to make the results controllable. We\u0000construct three different tooth arrangement datasets and achieve drastically\u0000improved performance on accuracy and speed compared with existing methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper
6D object pose estimation is the problem of identifying the position and orientation of an object relative to a chosen coordinate system, which is a core technology for modern XR applications. State-of-the-art 6D object pose estimators directly predict an object pose given an object observation. Due to the ill-posed nature of the pose estimation problem, where multiple different poses can correspond to a single observation, generating additional plausible estimates per observation can be valuable. To address this, we reformulate the state-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-End Probabilistic Geometry-Guided Regression). Instead of predicting a single pose per detection, we estimate a probability density distribution of the pose. Using the evaluation procedure defined by the BOP (Benchmark for 6D Object Pose Estimation) Challenge, we test our approach on four of its core datasets and demonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, and ITODD. Our probabilistic solution shows that predicting a pose distribution instead of a single pose can improve state-of-the-art single-view pose estimation while providing the additional benefit of being able to sample multiple meaningful pose candidates.
{"title":"End-to-End Probabilistic Geometry-Guided Regression for 6DoF Object Pose Estimation","authors":"Thomas Pöllabauer, Jiayin Li, Volker Knauthe, Sarah Berkei, Arjan Kuijper","doi":"arxiv-2409.11819","DOIUrl":"https://doi.org/arxiv-2409.11819","url":null,"abstract":"6D object pose estimation is the problem of identifying the position and\u0000orientation of an object relative to a chosen coordinate system, which is a\u0000core technology for modern XR applications. State-of-the-art 6D object pose\u0000estimators directly predict an object pose given an object observation. Due to\u0000the ill-posed nature of the pose estimation problem, where multiple different\u0000poses can correspond to a single observation, generating additional plausible\u0000estimates per observation can be valuable. To address this, we reformulate the\u0000state-of-the-art algorithm GDRNPP and introduce EPRO-GDR (End-to-End\u0000Probabilistic Geometry-Guided Regression). Instead of predicting a single pose\u0000per detection, we estimate a probability density distribution of the pose.\u0000Using the evaluation procedure defined by the BOP (Benchmark for 6D Object Pose\u0000Estimation) Challenge, we test our approach on four of its core datasets and\u0000demonstrate superior quantitative results for EPRO-GDR on LM-O, YCB-V, and\u0000ITODD. Our probabilistic solution shows that predicting a pose distribution\u0000instead of a single pose can improve state-of-the-art single-view pose\u0000estimation while providing the additional benefit of being able to sample\u0000multiple meaningful pose candidates.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras
We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.
{"title":"JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation","authors":"Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras","doi":"arxiv-2409.12156","DOIUrl":"https://doi.org/arxiv-2409.12156","url":null,"abstract":"We introduce a novel method for joint expression and audio-guided talking\u0000face generation. Recent approaches either struggle to preserve the speaker\u0000identity or fail to produce faithful facial expressions. To address these\u0000challenges, we propose a NeRF-based network. Since we train our network on\u0000monocular videos without any ground truth, it is essential to learn\u0000disentangled representations for audio and expression. We first learn audio\u0000features in a self-supervised manner, given utterances from multiple subjects.\u0000By incorporating a contrastive learning technique, we ensure that the learned\u0000audio features are aligned to the lip motion and disentangled from the muscle\u0000motion of the rest of the face. We then devise a transformer-based architecture\u0000that learns expression features, capturing long-range facial expressions and\u0000disentangling them from the speech-specific mouth movements. Through\u0000quantitative and qualitative evaluation, we demonstrate that our method can\u0000synthesize high-fidelity talking face videos, achieving state-of-the-art facial\u0000expression transfer along with lip synchronization to unseen audio.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund
We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.
{"title":"Agglomerative Token Clustering","authors":"Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund","doi":"arxiv-2409.11923","DOIUrl":"https://doi.org/arxiv-2409.11923","url":null,"abstract":"We present Agglomerative Token Clustering (ATC), a novel token merging method\u0000that consistently outperforms previous token merging and pruning methods across\u0000image classification, image synthesis, and object detection & segmentation\u0000tasks. ATC merges clusters through bottom-up hierarchical clustering, without\u0000the introduction of extra learnable parameters. We find that ATC achieves\u0000state-of-the-art performance across all tasks, and can even perform on par with\u0000prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning.\u0000ATC is particularly effective when applied with low keep rates, where only a\u0000small fraction of tokens are kept and retaining task performance is especially\u0000difficult.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord
Vision Language Models (VLMs) have shown impressive performances on numerous tasks but their zero-shot capabilities can be limited compared to dedicated or fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires `white-box' access to the model's architecture and weights as well as expertise to design the fine-tuning objectives and optimize the hyper-parameters, which are specific to each VLM and downstream task. In this work, we propose LLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by leveraging large language models (LLMs) so as to reason on their outputs. We demonstrate the effectiveness of LLM-wrapper on Referring Expression Comprehension (REC), a challenging open-vocabulary task that requires spatial and semantic reasoning. Our approach significantly boosts the performance of off-the-shelf models, resulting in competitive results when compared with classic fine-tuning.
{"title":"LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models","authors":"Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord","doi":"arxiv-2409.11919","DOIUrl":"https://doi.org/arxiv-2409.11919","url":null,"abstract":"Vision Language Models (VLMs) have shown impressive performances on numerous\u0000tasks but their zero-shot capabilities can be limited compared to dedicated or\u0000fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires\u0000`white-box' access to the model's architecture and weights as well as expertise\u0000to design the fine-tuning objectives and optimize the hyper-parameters, which\u0000are specific to each VLM and downstream task. In this work, we propose\u0000LLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by\u0000leveraging large language models (LLMs) so as to reason on their outputs. We\u0000demonstrate the effectiveness of LLM-wrapper on Referring Expression\u0000Comprehension (REC), a challenging open-vocabulary task that requires spatial\u0000and semantic reasoning. Our approach significantly boosts the performance of\u0000off-the-shelf models, resulting in competitive results when compared with\u0000classic fine-tuning.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce Geometry-Inverse-Meet-Pixel-Insert, short for GEO, an exceptionally versatile image editing technique designed to cater to customized user requirements at both local and global scales. Our approach seamlessly integrates text prompts and image prompts to yield diverse and precise editing outcomes. Notably, our method operates without the need for training and is driven by two key contributions: (i) a novel geometric accumulation loss that enhances DDIM inversion to faithfully preserve pixel space geometry and layout, and (ii) an innovative boosted image prompt technique that combines pixel-level editing for text-only inversion with latent space geometry guidance for standard classifier-free reversion. Leveraging the publicly available Stable Diffusion model, our approach undergoes extensive evaluation across various image types and challenging prompt editing scenarios, consistently delivering high-fidelity editing results for real images.
在本文中,我们介绍了几何反转与像素插入(Geometry-Inverse-Meet-Pixel-Insert,简称GEO),这是一种非常灵活的图像编辑技术,旨在满足用户在局部和全局范围内的个性化需求。我们的方法将文本提示和图像提示完美地结合在一起,从而产生多样化的精确编辑结果。值得注意的是,我们的方法无需训练即可运行,并由两个关键贡献驱动:(i) 一种新颖的几何累积损失,可增强 DDIM 反演以忠实保留像素空间的几何和布局;(ii) 一种创新的增强图像提示技术,可将纯文本反演的像素级编辑与标准无分类器反演的潜空间几何引导相结合。利用公开的稳定扩散模型,我们的方法在各种图像类型和具有挑战性的提示编辑场景中进行了广泛的评估,始终如一地为真实图像提供高保真编辑结果。
{"title":"InverseMeetInsert: Robust Real Image Editing via Geometric Accumulation Inversion in Guided Diffusion Models","authors":"Yan Zheng, Lemeng Wu","doi":"arxiv-2409.11734","DOIUrl":"https://doi.org/arxiv-2409.11734","url":null,"abstract":"In this paper, we introduce Geometry-Inverse-Meet-Pixel-Insert, short for\u0000GEO, an exceptionally versatile image editing technique designed to cater to\u0000customized user requirements at both local and global scales. Our approach\u0000seamlessly integrates text prompts and image prompts to yield diverse and\u0000precise editing outcomes. Notably, our method operates without the need for\u0000training and is driven by two key contributions: (i) a novel geometric\u0000accumulation loss that enhances DDIM inversion to faithfully preserve pixel\u0000space geometry and layout, and (ii) an innovative boosted image prompt\u0000technique that combines pixel-level editing for text-only inversion with latent\u0000space geometry guidance for standard classifier-free reversion. Leveraging the\u0000publicly available Stable Diffusion model, our approach undergoes extensive\u0000evaluation across various image types and challenging prompt editing scenarios,\u0000consistently delivering high-fidelity editing results for real images.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall
State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
{"title":"Distillation-free Scaling of Large SSMs for Images and Videos","authors":"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall","doi":"arxiv-2409.11867","DOIUrl":"https://doi.org/arxiv-2409.11867","url":null,"abstract":"State-space models (SSMs), exemplified by S4, have introduced a novel context\u0000modeling method by integrating state-space techniques into deep learning.\u0000However, they struggle with global context modeling due to their\u0000data-independent matrices. The Mamba model addressed this with data-dependent\u0000variants via the S6 selective-scan algorithm, enhancing context modeling,\u0000especially for long sequences. However, Mamba-based architectures are difficult\u0000to scale with respect to the number of parameters, which is a major limitation\u0000for vision applications. This paper addresses the scalability issue of large\u0000SSMs for image classification and action recognition without requiring\u0000additional techniques like knowledge distillation. We analyze the distinct\u0000characteristics of Mamba-based and Attention-based models, proposing a\u0000Mamba-Attention interleaved architecture that enhances scalability, robustness,\u0000and performance. We demonstrate that the stable and efficient interleaved\u0000architecture resolves the scalability issue of Mamba-based architectures for\u0000images and videos and increases robustness to common artifacts like JPEG\u0000compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\u0000Something-Something-v2 benchmarks demonstrates that our approach improves the\u0000accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan
The current bottleneck in continuous sign language recognition (CSLR) research lies in the fact that most publicly available datasets are limited to laboratory environments or television program recordings, resulting in a single background environment with uniform lighting, which significantly deviates from the diversity and complexity found in real-life scenarios. To address this challenge, we have constructed a new, large-scale dataset for Chinese continuous sign language (CSL) based on complex environments, termed the complex environment - chinese sign language dataset (CE-CSL). This dataset encompasses 5,988 continuous CSL video clips collected from daily life scenes, featuring more than 70 different complex backgrounds to ensure representativeness and generalization capability. To tackle the impact of complex backgrounds on CSLR performance, we propose a time-frequency network (TFNet) model for continuous sign language recognition. This model extracts frame-level features and then utilizes both temporal and spectral information to separately derive sequence features before fusion, aiming to achieve efficient and accurate CSLR. Experimental results demonstrate that our approach achieves significant performance improvements on the CE-CSL, validating its effectiveness under complex background conditions. Additionally, our proposed method has also yielded highly competitive results when applied to three publicly available CSL datasets.
{"title":"A Chinese Continuous Sign Language Dataset Based on Complex Environments","authors":"Qidan Zhu, Jing Li, Fei Yuan, Jiaojiao Fan, Quan Gan","doi":"arxiv-2409.11960","DOIUrl":"https://doi.org/arxiv-2409.11960","url":null,"abstract":"The current bottleneck in continuous sign language recognition (CSLR)\u0000research lies in the fact that most publicly available datasets are limited to\u0000laboratory environments or television program recordings, resulting in a single\u0000background environment with uniform lighting, which significantly deviates from\u0000the diversity and complexity found in real-life scenarios. To address this\u0000challenge, we have constructed a new, large-scale dataset for Chinese\u0000continuous sign language (CSL) based on complex environments, termed the\u0000complex environment - chinese sign language dataset (CE-CSL). This dataset\u0000encompasses 5,988 continuous CSL video clips collected from daily life scenes,\u0000featuring more than 70 different complex backgrounds to ensure\u0000representativeness and generalization capability. To tackle the impact of\u0000complex backgrounds on CSLR performance, we propose a time-frequency network\u0000(TFNet) model for continuous sign language recognition. This model extracts\u0000frame-level features and then utilizes both temporal and spectral information\u0000to separately derive sequence features before fusion, aiming to achieve\u0000efficient and accurate CSLR. Experimental results demonstrate that our approach\u0000achieves significant performance improvements on the CE-CSL, validating its\u0000effectiveness under complex background conditions. Additionally, our proposed\u0000method has also yielded highly competitive results when applied to three\u0000publicly available CSL datasets.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}