Facial-video based Remote photoplethysmography (rPPG) aims at measuring physiological signals and monitoring heart activity without any contact, showing significant potential in various applications. Previous deep learning based rPPG measurement are primarily based on CNNs and Transformers. However, the limited receptive fields of CNNs restrict their ability to capture long-range spatio-temporal dependencies, while Transformers also struggle with modeling long video sequences with high complexity. Recently, the state space models (SSMs) represented by Mamba are known for their impressive performance on capturing long-range dependencies from long sequences. In this paper, we propose the PhysMamba, a Mamba-based framework, to efficiently represent long-range physiological dependencies from facial videos. Specifically, we introduce the Temporal Difference Mamba block to first enhance local dynamic differences and further model the long-range spatio-temporal context. Moreover, a dual-stream SlowFast architecture is utilized to fuse the multi-scale temporal features. Extensive experiments are conducted on three benchmark datasets to demonstrate the superiority and efficiency of PhysMamba. The codes are available at https://github.com/Chaoqi31/PhysMamba
{"title":"PhysMamba: Efficient Remote Physiological Measurement with SlowFast Temporal Difference Mamba","authors":"Chaoqi Luo, Yiping Xie, Zitong Yu","doi":"arxiv-2409.12031","DOIUrl":"https://doi.org/arxiv-2409.12031","url":null,"abstract":"Facial-video based Remote photoplethysmography (rPPG) aims at measuring\u0000physiological signals and monitoring heart activity without any contact,\u0000showing significant potential in various applications. Previous deep learning\u0000based rPPG measurement are primarily based on CNNs and Transformers. However,\u0000the limited receptive fields of CNNs restrict their ability to capture\u0000long-range spatio-temporal dependencies, while Transformers also struggle with\u0000modeling long video sequences with high complexity. Recently, the state space\u0000models (SSMs) represented by Mamba are known for their impressive performance\u0000on capturing long-range dependencies from long sequences. In this paper, we\u0000propose the PhysMamba, a Mamba-based framework, to efficiently represent\u0000long-range physiological dependencies from facial videos. Specifically, we\u0000introduce the Temporal Difference Mamba block to first enhance local dynamic\u0000differences and further model the long-range spatio-temporal context. Moreover,\u0000a dual-stream SlowFast architecture is utilized to fuse the multi-scale\u0000temporal features. Extensive experiments are conducted on three benchmark\u0000datasets to demonstrate the superiority and efficiency of PhysMamba. The codes\u0000are available at https://github.com/Chaoqi31/PhysMamba","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study. Code is available at https://github.com/felixbmuller/SAST.
{"title":"Massively Multi-Person 3D Human Motion Forecasting with Scene Context","authors":"Felix B Mueller, Julian Tanke, Juergen Gall","doi":"arxiv-2409.12189","DOIUrl":"https://doi.org/arxiv-2409.12189","url":null,"abstract":"Forecasting long-term 3D human motion is challenging: the stochasticity of\u0000human behavior makes it hard to generate realistic human motion from the input\u0000sequence alone. Information on the scene environment and the motion of nearby\u0000people can greatly aid the generation process. We propose a scene-aware social\u0000transformer model (SAST) to forecast long-term (10s) human motion motion.\u0000Unlike previous models, our approach can model interactions between both widely\u0000varying numbers of people and objects in a scene. We combine a temporal\u0000convolutional encoder-decoder architecture with a Transformer-based bottleneck\u0000that allows us to efficiently combine motion and scene information. We model\u0000the conditional motion distribution using denoising diffusion models. We\u0000benchmark our approach on the Humans in Kitchens dataset, which contains 1 to\u000016 persons and 29 to 50 objects that are visible simultaneously. Our model\u0000outperforms other approaches in terms of realism and diversity on different\u0000metrics and in a user study. Code is available at\u0000https://github.com/felixbmuller/SAST.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peiyu Li, Xiaobao Huang, Yijun Tian, Nitesh V. Chawla
Significant work has been conducted in the domain of food computing, yet these studies typically focus on single tasks such as t2t (instruction generation from food titles and ingredients), i2t (recipe generation from food images), or t2i (food image generation from recipes). None of these approaches integrate all modalities simultaneously. To address this gap, we introduce a novel food computing foundation model that achieves true multimodality, encompassing tasks such as t2t, t2i, i2t, it2t, and t2ti. By leveraging large language models (LLMs) and pre-trained image encoder and decoder models, our model can perform a diverse array of food computing-related tasks, including food understanding, food recognition, recipe generation, and food image generation. Compared to previous models, our foundation model demonstrates a significantly broader range of capabilities and exhibits superior performance, particularly in food image generation and recipe generation tasks. We open-sourced ChefFusion at GitHub.
{"title":"ChefFusion: Multimodal Foundation Model Integrating Recipe and Food Image Generation","authors":"Peiyu Li, Xiaobao Huang, Yijun Tian, Nitesh V. Chawla","doi":"arxiv-2409.12010","DOIUrl":"https://doi.org/arxiv-2409.12010","url":null,"abstract":"Significant work has been conducted in the domain of food computing, yet\u0000these studies typically focus on single tasks such as t2t (instruction\u0000generation from food titles and ingredients), i2t (recipe generation from food\u0000images), or t2i (food image generation from recipes). None of these approaches\u0000integrate all modalities simultaneously. To address this gap, we introduce a\u0000novel food computing foundation model that achieves true multimodality,\u0000encompassing tasks such as t2t, t2i, i2t, it2t, and t2ti. By leveraging large\u0000language models (LLMs) and pre-trained image encoder and decoder models, our\u0000model can perform a diverse array of food computing-related tasks, including\u0000food understanding, food recognition, recipe generation, and food image\u0000generation. Compared to previous models, our foundation model demonstrates a\u0000significantly broader range of capabilities and exhibits superior performance,\u0000particularly in food image generation and recipe generation tasks. We\u0000open-sourced ChefFusion at GitHub.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiming Ge, Zhao Luo, Chunhui Zhang, Yingying Hua, Dacheng Tao
Deep trackers have proven success in visual tracking. Typically, these trackers employ optimally pre-trained deep networks to represent all diverse objects with multi-channel features from some fixed layers. The deep networks employed are usually trained to extract rich knowledge from massive data used in object classification and so they are capable to represent generic objects very well. However, these networks are too complex to represent a specific moving object, leading to poor generalization as well as high computational and memory costs. This paper presents a novel and general framework termed channel distillation to facilitate deep trackers. To validate the effectiveness of channel distillation, we take discriminative correlation filter (DCF) and ECO for example. We demonstrate that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem to adaptively select informative feature channels that improve the efficacy of tracking moving objects on the fly. Channel distillation can accurately extract good channels, alleviating the influence of noisy channels and generally reducing the number of channels, as well as adaptively generalizing to different channels and networks. The resulting deep tracker is accurate, fast, and has low memory requirements. Extensive experimental evaluations on popular benchmarks clearly demonstrate the effectiveness and generalizability of our framework.
{"title":"Distilling Channels for Efficient Deep Tracking","authors":"Shiming Ge, Zhao Luo, Chunhui Zhang, Yingying Hua, Dacheng Tao","doi":"arxiv-2409.11785","DOIUrl":"https://doi.org/arxiv-2409.11785","url":null,"abstract":"Deep trackers have proven success in visual tracking. Typically, these\u0000trackers employ optimally pre-trained deep networks to represent all diverse\u0000objects with multi-channel features from some fixed layers. The deep networks\u0000employed are usually trained to extract rich knowledge from massive data used\u0000in object classification and so they are capable to represent generic objects\u0000very well. However, these networks are too complex to represent a specific\u0000moving object, leading to poor generalization as well as high computational and\u0000memory costs. This paper presents a novel and general framework termed channel\u0000distillation to facilitate deep trackers. To validate the effectiveness of\u0000channel distillation, we take discriminative correlation filter (DCF) and ECO\u0000for example. We demonstrate that an integrated formulation can turn feature\u0000compression, response map generation, and model update into a unified energy\u0000minimization problem to adaptively select informative feature channels that\u0000improve the efficacy of tracking moving objects on the fly. Channel\u0000distillation can accurately extract good channels, alleviating the influence of\u0000noisy channels and generally reducing the number of channels, as well as\u0000adaptively generalizing to different channels and networks. The resulting deep\u0000tracker is accurate, fast, and has low memory requirements. Extensive\u0000experimental evaluations on popular benchmarks clearly demonstrate the\u0000effectiveness and generalizability of our framework.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep visual odometry, despite extensive research, still faces limitations in accuracy and generalizability that prevent its broader application. To address these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided visual odometry with selective online adaptation named ORB-SfMLearner. We present a novel use of ORB features for learning-based ego-motion estimation, leading to more robust and accurate results. We also introduce the cross-attention mechanism to enhance the explainability of PoseNet and have revealed that driving direction of the vehicle can be explained through attention weights, marking a novel exploration in this area. To improve generalizability, our selective online adaptation allows the network to rapidly and selectively adjust to the optimal parameters across different domains. Experimental results on KITTI and vKITTI datasets show that our method outperforms previous state-of-the-art deep visual odometry methods in terms of ego-motion accuracy and generalizability.
{"title":"ORB-SfMLearner: ORB-Guided Self-supervised Visual Odometry with Selective Online Adaptation","authors":"Yanlin Jin, Rui-Yang Ju, Haojun Liu, Yuzhong Zhong","doi":"arxiv-2409.11692","DOIUrl":"https://doi.org/arxiv-2409.11692","url":null,"abstract":"Deep visual odometry, despite extensive research, still faces limitations in\u0000accuracy and generalizability that prevent its broader application. To address\u0000these challenges, we propose an Oriented FAST and Rotated BRIEF (ORB)-guided\u0000visual odometry with selective online adaptation named ORB-SfMLearner. We\u0000present a novel use of ORB features for learning-based ego-motion estimation,\u0000leading to more robust and accurate results. We also introduce the\u0000cross-attention mechanism to enhance the explainability of PoseNet and have\u0000revealed that driving direction of the vehicle can be explained through\u0000attention weights, marking a novel exploration in this area. To improve\u0000generalizability, our selective online adaptation allows the network to rapidly\u0000and selectively adjust to the optimal parameters across different domains.\u0000Experimental results on KITTI and vKITTI datasets show that our method\u0000outperforms previous state-of-the-art deep visual odometry methods in terms of\u0000ego-motion accuracy and generalizability.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Side-scan sonar (SSS) imagery presents unique challenges in the classification of man-made objects on the seafloor due to the complex and varied underwater environments. Historically, experts have manually interpreted SSS images, relying on conventional machine learning techniques with hand-crafted features. While Convolutional Neural Networks (CNNs) significantly advanced automated classification in this domain, they often fall short when dealing with diverse seafloor textures, such as rocky or ripple sand bottoms, where false positive rates may increase. Recently, Vision Transformers (ViTs) have shown potential in addressing these limitations by utilizing a self-attention mechanism to capture global information in image patches, offering more flexibility in processing spatial hierarchies. This paper rigorously compares the performance of ViT models alongside commonly used CNN architectures, such as ResNet and ConvNext, for binary classification tasks in SSS imagery. The dataset encompasses diverse geographical seafloor types and is balanced between the presence and absence of man-made objects. ViT-based models exhibit superior classification performance across f1-score, precision, recall, and accuracy metrics, although at the cost of greater computational resources. CNNs, with their inductive biases, demonstrate better computational efficiency, making them suitable for deployment in resource-constrained environments like underwater vehicles. Future research directions include exploring self-supervised learning for ViTs and multi-modal fusion to further enhance performance in challenging underwater environments.
由于水下环境复杂多变,侧扫声纳(SSS)图像为海底人造物体的分类带来了独特的挑战。一直以来,专家们都是依靠传统的机器学习技术和人工创建的特征来人工解读 SSS 图像。虽然卷积神经网络(CNN)在这一领域大大推进了自动分类的发展,但在处理岩石或波纹沙底等多种海底纹理时,它们往往会出现不足,因为在这些海底纹理中,假阳性率可能会增加。最近,视觉变换器(ViTs)利用自身关注机制捕捉图像斑块中的全局信息,在处理空间层次方面提供了更大的灵活性,从而显示出解决这些局限性的潜力。本文主要比较了 ViT 模型与常用 CNN 体系结构(如 ResNet 和 ConvNext)在 SSS 图像二元分类任务中的性能。数据集涵盖了不同的海底地理类型,并在存在和不存在人造物体之间进行了平衡。基于 ViT 的模型在 f1 分数、精确度、召回率和准确度指标上都表现出卓越的分类性能,但代价是需要耗费更多的计算资源。未来的研究方向包括探索 ViT 的自我监督学习和多模态融合,以进一步提高在具有挑战性的水下环境中的性能。
{"title":"On Vision Transformers for Classification Tasks in Side-Scan Sonar Imagery","authors":"BW Sheffield, Jeffrey Ellen, Ben Whitmore","doi":"arxiv-2409.12026","DOIUrl":"https://doi.org/arxiv-2409.12026","url":null,"abstract":"Side-scan sonar (SSS) imagery presents unique challenges in the\u0000classification of man-made objects on the seafloor due to the complex and\u0000varied underwater environments. Historically, experts have manually interpreted\u0000SSS images, relying on conventional machine learning techniques with\u0000hand-crafted features. While Convolutional Neural Networks (CNNs) significantly\u0000advanced automated classification in this domain, they often fall short when\u0000dealing with diverse seafloor textures, such as rocky or ripple sand bottoms,\u0000where false positive rates may increase. Recently, Vision Transformers (ViTs)\u0000have shown potential in addressing these limitations by utilizing a\u0000self-attention mechanism to capture global information in image patches,\u0000offering more flexibility in processing spatial hierarchies. This paper\u0000rigorously compares the performance of ViT models alongside commonly used CNN\u0000architectures, such as ResNet and ConvNext, for binary classification tasks in\u0000SSS imagery. The dataset encompasses diverse geographical seafloor types and is\u0000balanced between the presence and absence of man-made objects. ViT-based models\u0000exhibit superior classification performance across f1-score, precision, recall,\u0000and accuracy metrics, although at the cost of greater computational resources.\u0000CNNs, with their inductive biases, demonstrate better computational efficiency,\u0000making them suitable for deployment in resource-constrained environments like\u0000underwater vehicles. Future research directions include exploring\u0000self-supervised learning for ViTs and multi-modal fusion to further enhance\u0000performance in challenging underwater environments.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.
{"title":"Generation of Complex 3D Human Motion by Temporal and Spatial Composition of Diffusion Models","authors":"Lorenzo Mandelli, Stefano Berretti","doi":"arxiv-2409.11920","DOIUrl":"https://doi.org/arxiv-2409.11920","url":null,"abstract":"In this paper, we address the challenge of generating realistic 3D human\u0000motions for action classes that were never seen during the training phase. Our\u0000approach involves decomposing complex actions into simpler movements,\u0000specifically those observed during training, by leveraging the knowledge of\u0000human motion contained in GPTs models. These simpler movements are then\u0000combined into a single, realistic animation using the properties of diffusion\u0000models. Our claim is that this decomposition and subsequent recombination of\u0000simple movements can synthesize an animation that accurately represents the\u0000complex input action. This method operates during the inference phase and can\u0000be integrated with any pre-trained diffusion model, enabling the synthesis of\u0000motion classes not present in the training data. We evaluate our method by\u0000dividing two benchmark human motion datasets into basic and complex actions,\u0000and then compare its performance against the state-of-the-art.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang
Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.
{"title":"GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation","authors":"Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang","doi":"arxiv-2409.11689","DOIUrl":"https://doi.org/arxiv-2409.11689","url":null,"abstract":"Pose skeleton images are an important reference in pose-controllable image\u0000generation. In order to enrich the source of skeleton images, recent works have\u0000investigated the generation of pose skeletons based on natural language. These\u0000methods are based on GANs. However, it remains challenging to perform diverse,\u0000structurally correct and aesthetically pleasing human pose skeleton generation\u0000with various textual inputs. To address this problem, we propose a framework\u0000with GUNet as the main model, PoseDiffusion. It is the first generative\u0000framework based on a diffusion model and also contains a series of variants\u0000fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates\u0000several desired properties that outperform existing methods. 1) Correct\u0000Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to\u0000incorporate graphical convolutional neural networks. It is able to learn the\u0000spatial relationships of the human skeleton by introducing skeletal information\u0000during the training process. 2) Diversity. We decouple the key points of the\u0000skeleton and characterise them separately, and use cross-attention to introduce\u0000textual conditions. Experimental results show that PoseDiffusion outperforms\u0000existing SoTA algorithms in terms of stability and diversity of text-driven\u0000pose skeleton generation. Qualitative analyses further demonstrate its\u0000superiority for controllable generation in Stable Diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent studies suggest a potential link between the physical structure of mitochondria and neurodegenerative diseases. With advances in Electron Microscopy techniques, it has become possible to visualize the boundary and internal membrane structures of mitochondria in detail. It is crucial to automatically segment mitochondria from these images to investigate the relationship between mitochondria and diseases. In this paper, we present a software solution for mitochondrial segmentation, highlighting mitochondria boundaries in electron microscopy tomography images and generating corresponding 3D meshes.
{"title":"MitoSeg: Mitochondria Segmentation Tool","authors":"Faris Serdar Taşel, Efe Çiftci","doi":"arxiv-2409.11974","DOIUrl":"https://doi.org/arxiv-2409.11974","url":null,"abstract":"Recent studies suggest a potential link between the physical structure of\u0000mitochondria and neurodegenerative diseases. With advances in Electron\u0000Microscopy techniques, it has become possible to visualize the boundary and\u0000internal membrane structures of mitochondria in detail. It is crucial to\u0000automatically segment mitochondria from these images to investigate the\u0000relationship between mitochondria and diseases. In this paper, we present a\u0000software solution for mitochondrial segmentation, highlighting mitochondria\u0000boundaries in electron microscopy tomography images and generating\u0000corresponding 3D meshes.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and quantify the state of the art. Leveraging Rapidata's technology, we present an efficient annotation framework that sources human feedback from a diverse, global pool of annotators. Our study collected over 2 million annotations across 4,512 images, evaluating four prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style preference, coherence, and text-to-image alignment. We demonstrate that our approach makes it feasible to comprehensively rank image generation models based on a vast pool of annotators and show that the diverse annotator demographics reflect the world population, significantly decreasing the risk of biases.
{"title":"Finding the Subjective Truth: Collecting 2 Million Votes for Comprehensive Gen-AI Model Evaluation","authors":"Dimitrios Christodoulou, Mads Kuhlmann-Jørgensen","doi":"arxiv-2409.11904","DOIUrl":"https://doi.org/arxiv-2409.11904","url":null,"abstract":"Efficiently evaluating the performance of text-to-image models is difficult\u0000as it inherently requires subjective judgment and human preference, making it\u0000hard to compare different models and quantify the state of the art. Leveraging\u0000Rapidata's technology, we present an efficient annotation framework that\u0000sources human feedback from a diverse, global pool of annotators. Our study\u0000collected over 2 million annotations across 4,512 images, evaluating four\u0000prominent models (DALL-E 3, Flux.1, MidJourney, and Stable Diffusion) on style\u0000preference, coherence, and text-to-image alignment. We demonstrate that our\u0000approach makes it feasible to comprehensively rank image generation models\u0000based on a vast pool of annotators and show that the diverse annotator\u0000demographics reflect the world population, significantly decreasing the risk of\u0000biases.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}