首页 > 最新文献

arXiv - CS - Computer Vision and Pattern Recognition最新文献

英文 中文
PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba PhysMamba:利用慢-快时差曼巴进行高效远程生理测量
Pub Date : 2024-09-18 DOI: arxiv-2409.12031
Chaoqi Luo, Yiping Xie, Zitong Yu
Facial-video based Remote photoplethysmography (rPPG) aims at measuringphysiological signals and monitoring heart activity without any contact,showing significant potential in various applications. Previous deep learningbased rPPG measurement are primarily based on CNNs and Transformers. However,the limited receptive fields of CNNs restrict their ability to capturelong-range spatio-temporal dependencies, while Transformers also struggle withmodeling long video sequences with high complexity. Recently, the state spacemodels (SSMs) represented by Mamba are known for their impressive performanceon capturing long-range dependencies from long sequences. In this paper, wepropose the PhysMamba, a Mamba-based framework, to efficiently representlong-range physiological dependencies from facial videos. Specifically, weintroduce the Temporal Difference Mamba block to first enhance local dynamicdifferences and further model the long-range spatio-temporal context. Moreover,a dual-stream SlowFast architecture is utilized to fuse the multi-scaletemporal features. Extensive experiments are conducted on three benchmarkdatasets to demonstrate the superiority and efficiency of PhysMamba. The codesare available at https://github.com/Chaoqi31/PhysMamba
基于面部视频的远程心电图(Remote photoplethysmography,rPPG)旨在测量生理信号,并在无任何接触的情况下监测心脏活动,在各种应用中显示出巨大的潜力。以往基于深度学习的 rPPG 测量主要基于 CNN 和变换器。然而,CNN 的感受野有限,限制了其捕捉长距离时空相关性的能力,而 Transformers 也难以模拟复杂度较高的长视频序列。最近,以 Mamba 为代表的状态空间模型(SSM)在捕捉长序列中的长距离依赖关系方面表现出色。在本文中,我们提出了一个基于 Mamba 的框架 PhysMamba,以有效表示面部视频中的长距离生理依赖关系。具体来说,我们引入了时差 Mamba 模块,首先增强局部动态差异,然后进一步建立长距离时空背景模型。此外,我们还利用双流 SlowFast 架构来融合多尺度时空特征。为了证明 PhysMamba 的优越性和高效性,我们在三个基准数据集上进行了广泛的实验。代码可在 https://github.com/Chaoqi31/PhysMamba
{"title":"PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba","authors":"Chaoqi Luo, Yiping Xie, Zitong Yu","doi":"arxiv-2409.12031","DOIUrl":"https://doi.org/arxiv-2409.12031","url":null,"abstract":"Facial-video based Remote photoplethysmography (rPPG) aims at measuring\u0000physiological signals and monitoring heart activity without any contact,\u0000showing significant potential in various applications. Previous deep learning\u0000based rPPG measurement are primarily based on CNNs and Transformers. However,\u0000the limited receptive fields of CNNs restrict their ability to capture\u0000long-range spatio-temporal dependencies, while Transformers also struggle with\u0000modeling long video sequences with high complexity. Recently, the state space\u0000models (SSMs) represented by Mamba are known for their impressive performance\u0000on capturing long-range dependencies from long sequences. In this paper, we\u0000propose the PhysMamba, a Mamba-based framework, to efficiently represent\u0000long-range physiological dependencies from facial videos. Specifically, we\u0000introduce the Temporal Difference Mamba block to first enhance local dynamic\u0000differences and further model the long-range spatio-temporal context. Moreover,\u0000a dual-stream SlowFast architecture is utilized to fuse the multi-scale\u0000temporal features. Extensive experiments are conducted on three benchmark\u0000datasets to demonstrate the superiority and efficiency of PhysMamba. The codes\u0000are available at https://github.com/Chaoqi31/PhysMamba","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Massively Multi-Person 3D Human Motion Forecasting with Scene Context 利用场景语境进行大规模多人三维人体运动预测
Pub Date : 2024-09-18 DOI: arxiv-2409.12189
Felix B Mueller, Julian Tanke, Juergen Gall
Forecasting long-term 3D human motion is challenging: the stochasticity ofhuman behavior makes it hard to generate realistic human motion from the inputsequence alone. Information on the scene environment and the motion of nearbypeople can greatly aid the generation process. We propose a scene-aware socialtransformer model (SAST) to forecast long-term (10s) human motion motion.Unlike previous models, our approach can model interactions between both widelyvarying numbers of people and objects in a scene. We combine a temporalconvolutional encoder-decoder architecture with a Transformer-based bottleneckthat allows us to efficiently combine motion and scene information. We modelthe conditional motion distribution using denoising diffusion models. Webenchmark our approach on the Humans in Kitchens dataset, which contains 1 to16 persons and 29 to 50 objects that are visible simultaneously. Our modeloutperforms other approaches in terms of realism and diversity on differentmetrics and in a user study. Code is available athttps://github.com/felixbmuller/SAST.
预测长期三维人体运动具有挑战性:由于人体行为具有随机性,因此很难仅凭输入序列生成逼真的人体运动。有关场景环境和附近人员运动的信息可以极大地帮助生成过程。与之前的模型不同,我们的方法可以模拟场景中数量变化很大的人和物体之间的互动。我们将时空卷积编码器-解码器架构与基于变换器的瓶颈相结合,从而有效地结合了运动和场景信息。我们使用去噪扩散模型对条件运动分布进行建模。我们在 "厨房中的人类 "数据集上对我们的方法进行了测试,该数据集包含 1 到 16 个人和 29 到 50 个同时可见的物体。我们的模型在不同指标的逼真度和多样性方面,以及在用户研究中,都优于其他方法。代码可在https://github.com/felixbmuller/SAST。
{"title":"Massively Multi-Person 3D Human Motion Forecasting with Scene Context","authors":"Felix B Mueller, Julian Tanke, Juergen Gall","doi":"arxiv-2409.12189","DOIUrl":"https://doi.org/arxiv-2409.12189","url":null,"abstract":"Forecasting long-term 3D human motion is challenging: the stochasticity of\u0000human behavior makes it hard to generate realistic human motion from the input\u0000sequence alone. Information on the scene environment and the motion of nearby\u0000people can greatly aid the generation process. We propose a scene-aware social\u0000transformer model (SAST) to forecast long-term (10s) human motion motion.\u0000Unlike previous models, our approach can model interactions between both widely\u0000varying numbers of people and objects in a scene. We combine a temporal\u0000convolutional encoder-decoder architecture with a Transformer-based bottleneck\u0000that allows us to efficiently combine motion and scene information. We model\u0000the conditional motion distribution using denoising diffusion models. We\u0000benchmark our approach on the Humans in Kitchens dataset, which contains 1 to\u000016 persons and 29 to 50 objects that are visible simultaneously. Our model\u0000outperforms other approaches in terms of realism and diversity on different\u0000metrics and in a user study. Code is available at\u0000https://github.com/felixbmuller/SAST.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation ChefFusion:整合食谱和食物图像生成的多模态基础模型
Pub Date : 2024-09-18 DOI: arxiv-2409.12010
Peiyu Li, Xiaobao Huang, Yijun Tian, Nitesh V. Chawla
Significant work has been conducted in the domain of food computing, yetthese studies typically focus on single tasks such as t2t (instructiongeneration from food titles and ingredients), i2t (recipe generation from foodimages), or t2i (food image generation from recipes). None of these approachesintegrate all modalities simultaneously. To address this gap, we introduce anovel food computing foundation model that achieves true multimodality,encompassing tasks such as t2t, t2i, i2t, it2t, and t2ti. By leveraging largelanguage models (LLMs) and pre-trained image encoder and decoder models, ourmodel can perform a diverse array of food computing-related tasks, includingfood understanding, food recognition, recipe generation, and food imagegeneration. Compared to previous models, our foundation model demonstrates asignificantly broader range of capabilities and exhibits superior performance,particularly in food image generation and recipe generation tasks. Weopen-sourced ChefFusion at GitHub.
在食品计算领域已经开展了大量工作,但这些研究通常只关注单一任务,如 t2t(根据食品名称和配料生成指令)、i2t(根据食品图像生成食谱)或 t2i(根据食谱生成食品图像)。这些方法都无法同时整合所有模式。为了弥补这一不足,我们引入了一种高级食品计算基础模型,它可以实现真正的多模态,包括 t2t、t2i、i2t、it2t 和 t2ti 等任务。通过利用大型语言模型(LLMs)和预先训练好的图像编码器与解码器模型,我们的模型可以执行各种与食品计算相关的任务,包括食品理解、食品识别、食谱生成和食品图像生成。与以前的模型相比,我们的基础模型的功能范围明显更广,尤其在食品图像生成和食谱生成任务中表现出卓越的性能。我们在 GitHub 上开源了 ChefFusion。
{"title":"ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation","authors":"Peiyu Li, Xiaobao Huang, Yijun Tian, Nitesh V. Chawla","doi":"arxiv-2409.12010","DOIUrl":"https://doi.org/arxiv-2409.12010","url":null,"abstract":"Significant work has been conducted in the domain of food computing, yet\u0000these studies typically focus on single tasks such as t2t (instruction\u0000generation from food titles and ingredients), i2t (recipe generation from food\u0000images), or t2i (food image generation from recipes). None of these approaches\u0000integrate all modalities simultaneously. To address this gap, we introduce a\u0000novel food computing foundation model that achieves true multimodality,\u0000encompassing tasks such as t2t, t2i, i2t, it2t, and t2ti. By leveraging large\u0000language models (LLMs) and pre-trained image encoder and decoder models, our\u0000model can perform a diverse array of food computing-related tasks, including\u0000food understanding, food recognition, recipe generation, and food image\u0000generation. Compared to previous models, our foundation model demonstrates a\u0000significantly broader range of capabilities and exhibits superior performance,\u0000particularly in food image generation and recipe generation tasks. We\u0000open-sourced ChefFusion at GitHub.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distilling Channels for Efficient Deep Tracking 提炼通道,实现高效深度跟踪
Pub Date : 2024-09-18 DOI: arxiv-2409.11785
Shiming Ge, Zhao Luo, Chunhui Zhang, Yingying Hua, Dacheng Tao
Deep trackers have proven success in visual tracking. Typically, thesetrackers employ optimally pre-trained deep networks to represent all diverseobjects with multi-channel features from some fixed layers. The deep networksemployed are usually trained to extract rich knowledge from massive data usedin object classification and so they are capable to represent generic objectsvery well. However, these networks are too complex to represent a specificmoving object, leading to poor generalization as well as high computational andmemory costs. This paper presents a novel and general framework termed channeldistillation to facilitate deep trackers. To validate the effectiveness ofchannel distillation, we take discriminative correlation filter (DCF) and ECOfor example. We demonstrate that an integrated formulation can turn featurecompression, response map generation, and model update into a unified energyminimization problem to adaptively select informative feature channels thatimprove the efficacy of tracking moving objects on the fly. Channeldistillation can accurately extract good channels, alleviating the influence ofnoisy channels and generally reducing the number of channels, as well asadaptively generalizing to different channels and networks. The resulting deeptracker is accurate, fast, and has low memory requirements. Extensiveexperimental evaluations on popular benchmarks clearly demonstrate theeffectiveness and generalizability of our framework.
事实证明,深度跟踪器在视觉跟踪方面取得了成功。通常情况下,深度跟踪器采用经过优化预训练的深度网络,通过一些固定层的多通道特征来表示所有不同的物体。所使用的深度网络通常是为了从用于物体分类的海量数据中提取丰富的知识而训练的,因此它们能够很好地表示一般物体。然而,这些网络过于复杂,无法表示特定的运动物体,导致泛化能力差,计算和内存成本高。本文提出了一种新颖的通用框架,称为通道分散(channeldistillation),以促进深度跟踪器的发展。为了验证通道蒸馏的有效性,我们以判别相关滤波器(DCF)和 ECO 为例。我们证明,一个集成的公式可以将特征压缩、响应图生成和模型更新转化为一个统一的能量最小化问题,从而自适应地选择信息丰富的特征通道,提高对移动物体的实时跟踪效率。通道蒸馏可以准确地提取好的通道,减轻杂讯通道的影响,普遍减少通道数量,还能适应不同的通道和网络。由此产生的深度跟踪器准确、快速、内存要求低。在流行基准上进行的广泛实验评估清楚地证明了我们框架的有效性和通用性。
{"title":"Distilling Channels for Efficient Deep Tracking","authors":"Shiming Ge, Zhao Luo, Chunhui Zhang, Yingying Hua, Dacheng Tao","doi":"arxiv-2409.11785","DOIUrl":"https://doi.org/arxiv-2409.11785","url":null,"abstract":"Deep trackers have proven success in visual tracking. Typically, these\u0000trackers employ optimally pre-trained deep networks to represent all diverse\u0000objects with multi-channel features from some fixed layers. The deep networks\u0000employed are usually trained to extract rich knowledge from massive data used\u0000in object classification and so they are capable to represent generic objects\u0000very well. However, these networks are too complex to represent a specific\u0000moving object, leading to poor generalization as well as high computational and\u0000memory costs. This paper presents a novel and general framework termed channel\u0000distillation to facilitate deep trackers. To validate the effectiveness of\u0000channel distillation, we take discriminative correlation filter (DCF) and ECO\u0000for example. We demonstrate that an integrated formulation can turn feature\u0000compression, response map generation, and model update into a unified energy\u0000minimization problem to adaptively select informative feature channels that\u0000improve the efficacy of tracking moving objects on the fly. Channel\u0000distillation can accurately extract good channels, alleviating the influence of\u0000noisy channels and generally reducing the number of channels, as well as\u0000adaptively generalizing to different channels and networks. The resulting deep\u0000tracker is accurate, fast, and has low memory requirements. Extensive\u0000experimental evaluations on popular benchmarks clearly demonstrate the\u0000effectiveness and generalizability of our framework.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation ORB-SfMLearner:具有选择性在线适应功能的 ORB 引导的自监督视觉测距仪
Pub Date : 2024-09-18 DOI: arxiv-2409.11692
Yanlin Jin, Rui-Yang Ju, Haojun Liu, Yuzhong Zhong
Deep visual odometry, despite extensive research, still faces limitations inaccuracy and generalizability that prevent its broader application. To addressthese challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guidedvisual odometry with selective online adaptation named ORB-SfMLearner. Wepresent a novel use of ORB features for learning-based ego-motion estimation,leading to more robust and accurate results. We also introduce thecross-attention mechanism to enhance the explainability of PoseNet and haverevealed that driving direction of the vehicle can be explained throughattention weights, marking a novel exploration in this area. To improvegeneralizability, our selective online adaptation allows the network to rapidlyand selectively adjust to the optimal parameters across different domains.Experimental results on KITTI and vKITTI datasets show that our methodoutperforms previous state-of-the-art deep visual odometry methods in terms ofego-motion accuracy and generalizability.
尽管进行了广泛的研究,深度视觉里程测量仍面临着不准确性和通用性的限制,这阻碍了它的广泛应用。为了应对这些挑战,我们提出了一种具有选择性在线适应功能的定向快速和旋转简短(ORB)引导的视觉里程计,命名为ORB-SfMLearner。我们将 ORB 特征用于基于学习的自我运动估计,从而获得更稳健、更准确的结果。我们还引入了交叉注意力机制来增强 PoseNet 的可解释性,并揭示了车辆的行驶方向可以通过注意力权重来解释,这标志着这一领域的新探索。在 KITTI 和 vKITTI 数据集上的实验结果表明,我们的方法在目标运动准确性和通用性方面优于之前最先进的深度视觉里程测量方法。
{"title":"ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation","authors":"Yanlin Jin, Rui-Yang Ju, Haojun Liu, Yuzhong Zhong","doi":"arxiv-2409.11692","DOIUrl":"https://doi.org/arxiv-2409.11692","url":null,"abstract":"Deep visual odometry, despite extensive research, still faces limitations in\u0000accuracy and generalizability that prevent its broader application. To address\u0000these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided\u0000visual odometry with selective online adaptation named ORB-SfMLearner. We\u0000present a novel use of ORB features for learning-based ego-motion estimation,\u0000leading to more robust and accurate results. We also introduce the\u0000cross-attention mechanism to enhance the explainability of PoseNet and have\u0000revealed that driving direction of the vehicle can be explained through\u0000attention weights, marking a novel exploration in this area. To improve\u0000generalizability, our selective online adaptation allows the network to rapidly\u0000and selectively adjust to the optimal parameters across different domains.\u0000Experimental results on KITTI and vKITTI datasets show that our method\u0000outperforms previous state-of-the-art deep visual odometry methods in terms of\u0000ego-motion accuracy and generalizability.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery 关于侧扫声纳图像分类任务中的视觉变换器
Pub Date : 2024-09-18 DOI: arxiv-2409.12026
BW Sheffield, Jeffrey Ellen, Ben Whitmore
Side-scan sonar (SSS) imagery presents unique challenges in theclassification of man-made objects on the seafloor due to the complex andvaried underwater environments. Historically, experts have manually interpretedSSS images, relying on conventional machine learning techniques withhand-crafted features. While Convolutional Neural Networks (CNNs) significantlyadvanced automated classification in this domain, they often fall short whendealing with diverse seafloor textures, such as rocky or ripple sand bottoms,where false positive rates may increase. Recently, Vision Transformers (ViTs)have shown potential in addressing these limitations by utilizing aself-attention mechanism to capture global information in image patches,offering more flexibility in processing spatial hierarchies. This paperrigorously compares the performance of ViT models alongside commonly used CNNarchitectures, such as ResNet and ConvNext, for binary classification tasks inSSS imagery. The dataset encompasses diverse geographical seafloor types and isbalanced between the presence and absence of man-made objects. ViT-based modelsexhibit superior classification performance across f1-score, precision, recall,and accuracy metrics, although at the cost of greater computational resources.CNNs, with their inductive biases, demonstrate better computational efficiency,making them suitable for deployment in resource-constrained environments likeunderwater vehicles. Future research directions include exploringself-supervised learning for ViTs and multi-modal fusion to further enhanceperformance in challenging underwater environments.
由于水下环境复杂多变,侧扫声纳(SSS)图像为海底人造物体的分类带来了独特的挑战。一直以来,专家们都是依靠传统的机器学习技术和人工创建的特征来人工解读 SSS 图像。虽然卷积神经网络(CNN)在这一领域大大推进了自动分类的发展,但在处理岩石或波纹沙底等多种海底纹理时,它们往往会出现不足,因为在这些海底纹理中,假阳性率可能会增加。最近,视觉变换器(ViTs)利用自身关注机制捕捉图像斑块中的全局信息,在处理空间层次方面提供了更大的灵活性,从而显示出解决这些局限性的潜力。本文主要比较了 ViT 模型与常用 CNN 体系结构(如 ResNet 和 ConvNext)在 SSS 图像二元分类任务中的性能。数据集涵盖了不同的海底地理类型,并在存在和不存在人造物体之间进行了平衡。基于 ViT 的模型在 f1 分数、精确度、召回率和准确度指标上都表现出卓越的分类性能,但代价是需要耗费更多的计算资源。未来的研究方向包括探索 ViT 的自我监督学习和多模态融合,以进一步提高在具有挑战性的水下环境中的性能。
{"title":"On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery","authors":"BW Sheffield, Jeffrey Ellen, Ben Whitmore","doi":"arxiv-2409.12026","DOIUrl":"https://doi.org/arxiv-2409.12026","url":null,"abstract":"Side-scan sonar (SSS) imagery presents unique challenges in the\u0000classification of man-made objects on the seafloor due to the complex and\u0000varied underwater environments. Historically, experts have manually interpreted\u0000SSS images, relying on conventional machine learning techniques with\u0000hand-crafted features. While Convolutional Neural Networks (CNNs) significantly\u0000advanced automated classification in this domain, they often fall short when\u0000dealing with diverse seafloor textures, such as rocky or ripple sand bottoms,\u0000where false positive rates may increase. Recently, Vision Transformers (ViTs)\u0000have shown potential in addressing these limitations by utilizing a\u0000self-attention mechanism to capture global information in image patches,\u0000offering more flexibility in processing spatial hierarchies. This paper\u0000rigorously compares the performance of ViT models alongside commonly used CNN\u0000architectures, such as ResNet and ConvNext, for binary classification tasks in\u0000SSS imagery. The dataset encompasses diverse geographical seafloor types and is\u0000balanced between the presence and absence of man-made objects. ViT-based models\u0000exhibit superior classification performance across f1-score, precision, recall,\u0000and accuracy metrics, although at the cost of greater computational resources.\u0000CNNs, with their inductive biases, demonstrate better computational efficiency,\u0000making them suitable for deployment in resource-constrained environments like\u0000underwater vehicles. Future research directions include exploring\u0000self-supervised learning for ViTs and multi-modal fusion to further enhance\u0000performance in challenging underwater environments.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models 通过扩散模型的时空组合生成复杂的三维人体运动
Pub Date : 2024-09-18 DOI: arxiv-2409.11920
Lorenzo Mandelli, Stefano Berretti
In this paper, we address the challenge of generating realistic 3D humanmotions for action classes that were never seen during the training phase. Ourapproach involves decomposing complex actions into simpler movements,specifically those observed during training, by leveraging the knowledge ofhuman motion contained in GPTs models. These simpler movements are thencombined into a single, realistic animation using the properties of diffusionmodels. Our claim is that this decomposition and subsequent recombination ofsimple movements can synthesize an animation that accurately represents thecomplex input action. This method operates during the inference phase and canbe integrated with any pre-trained diffusion model, enabling the synthesis ofmotion classes not present in the training data. We evaluate our method bydividing two benchmark human motion datasets into basic and complex actions,and then compare its performance against the state-of-the-art.
在本文中,我们要解决的难题是为训练阶段从未见过的动作类别生成逼真的三维人类动作。我们的方法是利用 GPTs 模型中包含的人类动作知识,将复杂动作分解为更简单的动作,特别是在训练过程中观察到的动作。然后利用扩散模型的特性,将这些较简单的动作组合成单个逼真的动画。我们的主张是,这种简单动作的分解和随后的重组可以合成一个能准确表现复杂输入动作的动画。这种方法在推理阶段运行,可以与任何预先训练好的扩散模型相结合,从而合成训练数据中不存在的动作类别。我们通过将两个基准人类动作数据集分为基本动作和复杂动作来评估我们的方法,然后将其性能与最先进的方法进行比较。
{"title":"Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models","authors":"Lorenzo Mandelli, Stefano Berretti","doi":"arxiv-2409.11920","DOIUrl":"https://doi.org/arxiv-2409.11920","url":null,"abstract":"In this paper, we address the challenge of generating realistic 3D human\u0000motions for action classes that were never seen during the training phase. Our\u0000approach involves decomposing complex actions into simpler movements,\u0000specifically those observed during training, by leveraging the knowledge of\u0000human motion contained in GPTs models. These simpler movements are then\u0000combined into a single, realistic animation using the properties of diffusion\u0000models. Our claim is that this decomposition and subsequent recombination of\u0000simple movements can synthesize an animation that accurately represents the\u0000complex input action. This method operates during the inference phase and can\u0000be integrated with any pre-trained diffusion model, enabling the synthesis of\u0000motion classes not present in the training data. We evaluate our method by\u0000dividing two benchmark human motion datasets into basic and complex actions,\u0000and then compare its performance against the state-of-the-art.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation GUNet:用于生成稳定和多样化姿势的图卷积网络联合扩散模型
Pub Date : 2024-09-18 DOI: arxiv-2409.11689
Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang
Pose skeleton images are an important reference in pose-controllable imagegeneration. In order to enrich the source of skeleton images, recent works haveinvestigated the generation of pose skeletons based on natural language. Thesemethods are based on GANs. However, it remains challenging to perform diverse,structurally correct and aesthetically pleasing human pose skeleton generationwith various textual inputs. To address this problem, we propose a frameworkwith GUNet as the main model, PoseDiffusion. It is the first generativeframework based on a diffusion model and also contains a series of variantsfine-tuned based on a stable diffusion model. PoseDiffusion demonstratesseveral desired properties that outperform existing methods. 1) CorrectSkeletons. GUNet, a denoising model of PoseDiffusion, is designed toincorporate graphical convolutional neural networks. It is able to learn thespatial relationships of the human skeleton by introducing skeletal informationduring the training process. 2) Diversity. We decouple the key points of theskeleton and characterise them separately, and use cross-attention to introducetextual conditions. Experimental results show that PoseDiffusion outperformsexisting SoTA algorithms in terms of stability and diversity of text-drivenpose skeleton generation. Qualitative analyses further demonstrate itssuperiority for controllable generation in Stable Diffusion.
姿势骨架图像是姿势可控图像生成的重要参考。为了丰富骨架图像的来源,最近有研究基于自然语言生成姿势骨架。这些方法都基于 GAN。然而,利用各种文本输入生成多样化、结构正确且美观的人体姿态骨架仍然具有挑战性。为了解决这个问题,我们提出了一个以 GUNet 为主要模型的框架,即 PoseDiffusion。它是第一个基于扩散模型的生成框架,还包含一系列基于稳定扩散模型进行微调的变体。PoseDiffusion 展示了优于现有方法的多个理想特性。1) 正确的骨架。GUNet 是 PoseDiffusion 的去噪模型,其设计目的是结合图形卷积神经网络。它能够通过在训练过程中引入骨骼信息来学习人体骨骼的空间关系。2) 多样性。我们将骨骼的关键点分离开来,分别描述它们的特征,并使用交叉注意引入文本条件。实验结果表明,在文本驱动骨架生成的稳定性和多样性方面,PoseDiffusion 优于现有的 SoTA 算法。定性分析进一步证明了 PoseDiffusion 在稳定扩散的可控生成方面更胜一筹。
{"title":"GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation","authors":"Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang","doi":"arxiv-2409.11689","DOIUrl":"https://doi.org/arxiv-2409.11689","url":null,"abstract":"Pose skeleton images are an important reference in pose-controllable image\u0000generation. In order to enrich the source of skeleton images, recent works have\u0000investigated the generation of pose skeletons based on natural language. These\u0000methods are based on GANs. However, it remains challenging to perform diverse,\u0000structurally correct and aesthetically pleasing human pose skeleton generation\u0000with various textual inputs. To address this problem, we propose a framework\u0000with GUNet as the main model, PoseDiffusion. It is the first generative\u0000framework based on a diffusion model and also contains a series of variants\u0000fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates\u0000several desired properties that outperform existing methods. 1) Correct\u0000Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to\u0000incorporate graphical convolutional neural networks. It is able to learn the\u0000spatial relationships of the human skeleton by introducing skeletal information\u0000during the training process. 2) Diversity. We decouple the key points of the\u0000skeleton and characterise them separately, and use cross-attention to introduce\u0000textual conditions. Experimental results show that PoseDiffusion outperforms\u0000existing SoTA algorithms in terms of stability and diversity of text-driven\u0000pose skeleton generation. Qualitative analyses further demonstrate its\u0000superiority for controllable generation in Stable Diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MitoSeg: Mitochondria Segmentation Tool MitoSeg: 线粒体分割工具
Pub Date : 2024-09-18 DOI: arxiv-2409.11974
Faris Serdar Taşel, Efe Çiftci
Recent studies suggest a potential link between the physical structure ofmitochondria and neurodegenerative diseases. With advances in ElectronMicroscopy techniques, it has become possible to visualize the boundary andinternal membrane structures of mitochondria in detail. It is crucial toautomatically segment mitochondria from these images to investigate therelationship between mitochondria and diseases. In this paper, we present asoftware solution for mitochondrial segmentation, highlighting mitochondriaboundaries in electron microscopy tomography images and generatingcorresponding 3D meshes.
最新研究表明,线粒体的物理结构与神经退行性疾病之间存在潜在联系。随着电子显微镜技术的发展,线粒体的边界和内膜结构的细节已变得可以可视化。从这些图像中自动分割线粒体对于研究线粒体与疾病之间的关系至关重要。本文介绍了一种线粒体分割软件解决方案,它可以突出显示电子显微镜断层扫描图像中的线粒体边界,并生成相应的三维网格。
{"title":"MitoSeg: Mitochondria Segmentation Tool","authors":"Faris Serdar Taşel, Efe Çiftci","doi":"arxiv-2409.11974","DOIUrl":"https://doi.org/arxiv-2409.11974","url":null,"abstract":"Recent studies suggest a potential link between the physical structure of\u0000mitochondria and neurodegenerative diseases. With advances in Electron\u0000Microscopy techniques, it has become possible to visualize the boundary and\u0000internal membrane structures of mitochondria in detail. It is crucial to\u0000automatically segment mitochondria from these images to investigate the\u0000relationship between mitochondria and diseases. In this paper, we present a\u0000software solution for mitochondrial segmentation, highlighting mitochondria\u0000boundaries in electron microscopy tomography images and generating\u0000corresponding 3D meshes.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation 寻找主观真相:为人工智能模型综合评估收集 200 万张选票
Pub Date : 2024-09-18 DOI: arxiv-2409.11904
Dimitrios Christodoulou, Mads Kuhlmann-Jørgensen
Efficiently evaluating the performance of text-to-image models is difficultas it inherently requires subjective judgment and human preference, making ithard to compare different models and quantify the state of the art. LeveragingRapidata's technology, we present an efficient annotation framework thatsources human feedback from a diverse, global pool of annotators. Our studycollected over 2 million annotations across 4,512 images, evaluating fourprominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on stylepreference, coherence, and text-to-image alignment. We demonstrate that ourapproach makes it feasible to comprehensively rank image generation modelsbased on a vast pool of annotators and show that the diverse annotatordemographics reflect the world population, significantly decreasing the risk ofbiases.
对文本到图像模型的性能进行有效评估十分困难,因为它本质上需要主观判断和人类偏好,因此很难对不同模型进行比较,也很难量化技术水平。利用 Rapidata 的技术,我们提出了一个高效的注释框架,该框架从多样化的全球注释者池中获取人类反馈。我们的研究收集了 4,512 张图片上的 200 多万条注释,对四种主要模型(DALL-E 3、Flux.1、MidJourney 和 Stable Diffusion)的风格偏好、连贯性和文本与图片的一致性进行了评估。我们证明,我们的方法可以在大量注释者的基础上对图像生成模型进行综合排名,并表明注释者的人口统计学特征反映了世界人口的多样性,从而大大降低了偏差风险。
{"title":"Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation","authors":"Dimitrios Christodoulou, Mads Kuhlmann-Jørgensen","doi":"arxiv-2409.11904","DOIUrl":"https://doi.org/arxiv-2409.11904","url":null,"abstract":"Efficiently evaluating the performance of text-to-image models is difficult\u0000as it inherently requires subjective judgment and human preference, making it\u0000hard to compare different models and quantify the state of the art. Leveraging\u0000Rapidata's technology, we present an efficient annotation framework that\u0000sources human feedback from a diverse, global pool of annotators. Our study\u0000collected over 2 million annotations across 4,512 images, evaluating four\u0000prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style\u0000preference, coherence, and text-to-image alignment. We demonstrate that our\u0000approach makes it feasible to comprehensively rank image generation models\u0000based on a vast pool of annotators and show that the diverse annotator\u0000demographics reflect the world population, significantly decreasing the risk of\u0000biases.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Computer Vision and Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1