Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang
Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.
{"title":"GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation","authors":"Shuowen Liang, Sisi Li, Qingyun Wang, Cen Zhang, Kaiquan Zhu, Tian Yang","doi":"arxiv-2409.11689","DOIUrl":"https://doi.org/arxiv-2409.11689","url":null,"abstract":"Pose skeleton images are an important reference in pose-controllable image\u0000generation. In order to enrich the source of skeleton images, recent works have\u0000investigated the generation of pose skeletons based on natural language. These\u0000methods are based on GANs. However, it remains challenging to perform diverse,\u0000structurally correct and aesthetically pleasing human pose skeleton generation\u0000with various textual inputs. To address this problem, we propose a framework\u0000with GUNet as the main model, PoseDiffusion. It is the first generative\u0000framework based on a diffusion model and also contains a series of variants\u0000fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates\u0000several desired properties that outperform existing methods. 1) Correct\u0000Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to\u0000incorporate graphical convolutional neural networks. It is able to learn the\u0000spatial relationships of the human skeleton by introducing skeletal information\u0000during the training process. 2) Diversity. We decouple the key points of the\u0000skeleton and characterise them separately, and use cross-attention to introduce\u0000textual conditions. Experimental results show that PoseDiffusion outperforms\u0000existing SoTA algorithms in terms of stability and diversity of text-driven\u0000pose skeleton generation. Qualitative analyses further demonstrate its\u0000superiority for controllable generation in Stable Diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad
With the ever-growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real-world applications.
{"title":"Applications of Knowledge Distillation in Remote Sensing: A Survey","authors":"Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad","doi":"arxiv-2409.12111","DOIUrl":"https://doi.org/arxiv-2409.12111","url":null,"abstract":"With the ever-growing complexity of models in the field of remote sensing\u0000(RS), there is an increasing demand for solutions that balance model accuracy\u0000with computational efficiency. Knowledge distillation (KD) has emerged as a\u0000powerful tool to meet this need, enabling the transfer of knowledge from large,\u0000complex models to smaller, more efficient ones without significant loss in\u0000performance. This review article provides an extensive examination of KD and\u0000its innovative applications in RS. KD, a technique developed to transfer\u0000knowledge from a complex, often cumbersome model (teacher) to a more compact\u0000and efficient model (student), has seen significant evolution and application\u0000across various domains. Initially, we introduce the fundamental concepts and\u0000historical progression of KD methods. The advantages of employing KD are\u0000highlighted, particularly in terms of model compression, enhanced computational\u0000efficiency, and improved performance, which are pivotal for practical\u0000deployments in RS scenarios. The article provides a comprehensive taxonomy of\u0000KD techniques, where each category is critically analyzed to demonstrate the\u0000breadth and depth of the alternative options, and illustrates specific case\u0000studies that showcase the practical implementation of KD methods in RS tasks,\u0000such as instance segmentation and object detection. Further, the review\u0000discusses the challenges and limitations of KD in RS, including practical\u0000constraints and prospective future directions, providing a comprehensive\u0000overview for researchers and practitioners in the field of RS. Through this\u0000organization, the paper not only elucidates the current state of research in KD\u0000but also sets the stage for future research opportunities, thereby contributing\u0000significantly to both academic research and real-world applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Remote Photoplethysmography (rPPG) is a non-contact method that uses facial video to predict changes in blood volume, enabling physiological metrics measurement. Traditional rPPG models often struggle with poor generalization capacity in unseen domains. Current solutions to this problem is to improve its generalization in the target domain through Domain Generalization (DG) or Domain Adaptation (DA). However, both traditional methods require access to both source domain data and target domain data, which cannot be implemented in scenarios with limited access to source data, and another issue is the privacy of accessing source domain data. In this paper, we propose the first Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which overcomes these limitations by enabling effective domain adaptation without access to source domain data. Our framework incorporates a Three-Branch Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency across domains. Furthermore, we propose a new rPPG distribution alignment loss based on the Frequency-domain Wasserstein Distance (FWD), which leverages optimal transport to align power spectrum distributions across domains effectively and further enforces the alignment of the three branches. Extensive cross-domain experiments and ablation studies demonstrate the effectiveness of our proposed method in source-free domain adaptation settings. Our findings highlight the significant contribution of the proposed FWD loss for distributional alignment, providing a valuable reference for future research and applications. The source code is available at https://github.com/XieYiping66/SFDA-rPPG
{"title":"SFDA-rPPG: Source-Free Domain Adaptive Remote Physiological Measurement with Spatio-Temporal Consistency","authors":"Yiping Xie, Zitong Yu, Bingjie Wu, Weicheng Xie, Linlin Shen","doi":"arxiv-2409.12040","DOIUrl":"https://doi.org/arxiv-2409.12040","url":null,"abstract":"Remote Photoplethysmography (rPPG) is a non-contact method that uses facial\u0000video to predict changes in blood volume, enabling physiological metrics\u0000measurement. Traditional rPPG models often struggle with poor generalization\u0000capacity in unseen domains. Current solutions to this problem is to improve its\u0000generalization in the target domain through Domain Generalization (DG) or\u0000Domain Adaptation (DA). However, both traditional methods require access to\u0000both source domain data and target domain data, which cannot be implemented in\u0000scenarios with limited access to source data, and another issue is the privacy\u0000of accessing source domain data. In this paper, we propose the first\u0000Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which\u0000overcomes these limitations by enabling effective domain adaptation without\u0000access to source domain data. Our framework incorporates a Three-Branch\u0000Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency\u0000across domains. Furthermore, we propose a new rPPG distribution alignment loss\u0000based on the Frequency-domain Wasserstein Distance (FWD), which leverages\u0000optimal transport to align power spectrum distributions across domains\u0000effectively and further enforces the alignment of the three branches. Extensive\u0000cross-domain experiments and ablation studies demonstrate the effectiveness of\u0000our proposed method in source-free domain adaptation settings. Our findings\u0000highlight the significant contribution of the proposed FWD loss for\u0000distributional alignment, providing a valuable reference for future research\u0000and applications. The source code is available at\u0000https://github.com/XieYiping66/SFDA-rPPG","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The intermittency of solar power, due to occlusion from cloud cover, is one of the key factors inhibiting its widespread use in both commercial and residential settings. Hence, real-time forecasting of solar irradiance for grid-connected photovoltaic systems is necessary to schedule and allocate resources across the grid. Ground-based imagers that capture wide field-of-view images of the sky are commonly used to monitor cloud movement around a particular site in an effort to forecast solar irradiance. However, these wide FOV imagers capture a distorted image of sky image, where regions near the horizon are heavily compressed. This hinders the ability to precisely predict cloud motion near the horizon which especially affects prediction over longer time horizons. In this work, we combat the aforementioned constraint by introducing a deep learning method to predict a future sky image frame with higher resolution than previous methods. Our main contribution is to derive an optimal warping method to counter the adverse affects of clouds at the horizon, and learn a framework for future sky image prediction which better determines cloud evolution for longer time horizons.
{"title":"Precise Forecasting of Sky Images Using Spatial Warping","authors":"Leron Julian, Aswin C. Sankaranarayanan","doi":"arxiv-2409.12162","DOIUrl":"https://doi.org/arxiv-2409.12162","url":null,"abstract":"The intermittency of solar power, due to occlusion from cloud cover, is one\u0000of the key factors inhibiting its widespread use in both commercial and\u0000residential settings. Hence, real-time forecasting of solar irradiance for\u0000grid-connected photovoltaic systems is necessary to schedule and allocate\u0000resources across the grid. Ground-based imagers that capture wide field-of-view\u0000images of the sky are commonly used to monitor cloud movement around a\u0000particular site in an effort to forecast solar irradiance. However, these wide\u0000FOV imagers capture a distorted image of sky image, where regions near the\u0000horizon are heavily compressed. This hinders the ability to precisely predict\u0000cloud motion near the horizon which especially affects prediction over longer\u0000time horizons. In this work, we combat the aforementioned constraint by\u0000introducing a deep learning method to predict a future sky image frame with\u0000higher resolution than previous methods. Our main contribution is to derive an\u0000optimal warping method to counter the adverse affects of clouds at the horizon,\u0000and learn a framework for future sky image prediction which better determines\u0000cloud evolution for longer time horizons.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Achieving human-like memory recall in artificial systems remains a challenging frontier in computer vision. Humans demonstrate remarkable ability to recall images after a single exposure, even after being shown thousands of images. However, this capacity diminishes significantly when confronted with non-natural stimuli such as random textures. In this paper, we present a method inspired by human memory processes to bridge this gap between artificial and biological memory systems. Our approach focuses on encoding images to mimic the high-level information retained by the human brain, rather than storing raw pixel data. By adding noise to images before encoding, we introduce variability akin to the non-deterministic nature of human memory encoding. Leveraging pre-trained models' embedding layers, we explore how different architectures encode images and their impact on memory recall. Our method achieves impressive results, with 97% accuracy on natural images and near-random performance (52%) on textures. We provide insights into the encoding process and its implications for machine learning memory systems, shedding light on the parallels between human and artificial intelligence memory mechanisms.
{"title":"Neural Encoding for Image Recall: Human-Like Memory","authors":"Virgile Foussereau, Robin Dumas","doi":"arxiv-2409.11750","DOIUrl":"https://doi.org/arxiv-2409.11750","url":null,"abstract":"Achieving human-like memory recall in artificial systems remains a\u0000challenging frontier in computer vision. Humans demonstrate remarkable ability\u0000to recall images after a single exposure, even after being shown thousands of\u0000images. However, this capacity diminishes significantly when confronted with\u0000non-natural stimuli such as random textures. In this paper, we present a method\u0000inspired by human memory processes to bridge this gap between artificial and\u0000biological memory systems. Our approach focuses on encoding images to mimic the\u0000high-level information retained by the human brain, rather than storing raw\u0000pixel data. By adding noise to images before encoding, we introduce variability\u0000akin to the non-deterministic nature of human memory encoding. Leveraging\u0000pre-trained models' embedding layers, we explore how different architectures\u0000encode images and their impact on memory recall. Our method achieves impressive\u0000results, with 97% accuracy on natural images and near-random performance (52%)\u0000on textures. We provide insights into the encoding process and its implications\u0000for machine learning memory systems, shedding light on the parallels between\u0000human and artificial intelligence memory mechanisms.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran
Over the past decade, there has been a steady advancement in enhancing face recognition algorithms leveraging advanced machine learning methods. The role of the loss function is pivotal in addressing face verification problems and playing a game-changing role. These loss functions have mainly explored variations among intra-class or inter-class separation. This research examines the natural phenomenon of facial symmetry in the face verification problem. The symmetry between the left and right hemi faces has been widely used in many research areas in recent decades. This paper adopts this simple approach judiciously by splitting the face image vertically into two halves. With the assumption that the natural phenomena of facial symmetry can enhance face verification methodology, we hypothesize that the two output embedding vectors of split faces must project close to each other in the output embedding space. Inspired by this concept, we penalize the network based on the disparity of embedding of the symmetrical pair of split faces. Symmetrical loss has the potential to minimize minor asymmetric features due to facial expression and lightning conditions, hence significantly increasing the inter-class variance among the classes and leading to more reliable face embedding. This loss function propels any network to outperform its baseline performance across all existing network architectures and configurations, enabling us to achieve SoTA results.
过去十年来,利用先进的机器学习方法改进人脸识别算法的工作取得了稳步进展。损失函数在解决人脸识别问题中起着举足轻重的作用,并扮演着改变游戏规则的角色。这些损失函数主要探讨了类内或类间分离的变化。本研究探讨了人脸验证问题中的人脸对称这一自然现象。近几十年来,左右半边脸对称已被广泛应用于许多研究领域。本文采用这种简单的方法,将人脸图像垂直分成两半。受这一概念的启发,我们根据对称的一对分割人脸的嵌入差异对网络进行惩罚。对称损失有可能最大限度地减少由于面部表情和光照条件而导致的轻微不对称特征,从而显著增加类间差异,实现更可靠的人脸嵌入。这种损耗功能可以推动任何网络在所有现有网络架构和配置中超越其基准性能,从而使我们获得 SoTA 结果。
{"title":"SymFace: Additional Facial Symmetry Loss for Deep Face Recognition","authors":"Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran","doi":"arxiv-2409.11816","DOIUrl":"https://doi.org/arxiv-2409.11816","url":null,"abstract":"Over the past decade, there has been a steady advancement in enhancing face\u0000recognition algorithms leveraging advanced machine learning methods. The role\u0000of the loss function is pivotal in addressing face verification problems and\u0000playing a game-changing role. These loss functions have mainly explored\u0000variations among intra-class or inter-class separation. This research examines\u0000the natural phenomenon of facial symmetry in the face verification problem. The\u0000symmetry between the left and right hemi faces has been widely used in many\u0000research areas in recent decades. This paper adopts this simple approach\u0000judiciously by splitting the face image vertically into two halves. With the\u0000assumption that the natural phenomena of facial symmetry can enhance face\u0000verification methodology, we hypothesize that the two output embedding vectors\u0000of split faces must project close to each other in the output embedding space.\u0000Inspired by this concept, we penalize the network based on the disparity of\u0000embedding of the symmetrical pair of split faces. Symmetrical loss has the\u0000potential to minimize minor asymmetric features due to facial expression and\u0000lightning conditions, hence significantly increasing the inter-class variance\u0000among the classes and leading to more reliable face embedding. This loss\u0000function propels any network to outperform its baseline performance across all\u0000existing network architectures and configurations, enabling us to achieve SoTA\u0000results.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li
End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.
{"title":"Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model","authors":"Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li","doi":"arxiv-2409.11969","DOIUrl":"https://doi.org/arxiv-2409.11969","url":null,"abstract":"End-to-end models are emerging as the mainstream in autonomous driving\u0000perception. However, the inability to meticulously deconstruct their internal\u0000mechanisms results in diminished development efficacy and impedes the\u0000establishment of trust. Pioneering in the issue, we present the Independent\u0000Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a\u0000novel framework that juxtaposes the module's feature maps against Ground Truth\u0000within a unified semantic Representation Space to quantify their similarity,\u0000thereby assessing the training maturity of individual functional modules. The\u0000core of the framework lies in the process of feature map encoding and\u0000representation aligning, facilitated by our proposed two-stage Alignment\u0000AutoEncoder, which ensures the preservation of salient information and the\u0000consistency of feature structure. The metric for evaluating the training\u0000maturity of functional modules, Similarity Score, demonstrates a robust\u0000positive correlation with BEV metrics, with an average correlation coefficient\u0000of 0.9387, attesting to the framework's reliability for assessment purposes.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.
{"title":"EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning","authors":"Yukun Tian, Hao Chen, Yongjian Deng, Feihong Shen, Kepan Liu, Wei You, Ziyang Zhang","doi":"arxiv-2409.11813","DOIUrl":"https://doi.org/arxiv-2409.11813","url":null,"abstract":"The event camera has demonstrated significant success across a wide range of\u0000areas due to its low time latency and high dynamic range. However, the\u0000community faces challenges such as data deficiency and limited diversity, often\u0000resulting in over-fitting and inadequate feature learning. Notably, the\u0000exploration of data augmentation techniques in the event community remains\u0000scarce. This work aims to address this gap by introducing a systematic\u0000augmentation scheme named EventAug to enrich spatial-temporal diversity. In\u0000particular, we first propose Multi-scale Temporal Integration (MSTI) to\u0000diversify the motion speed of objects, then introduce Spatial-salient Event\u0000Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants.\u0000Our EventAug can facilitate models learning with richer motion patterns, object\u0000variants and local spatio-temporal relations, thus improving model robustness\u0000to varied moving speeds, occlusions, and action disruptions. Experiment results\u0000show that our augmentation method consistently yields significant improvements\u0000across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128\u0000Gesture). Our code will be publicly available for this community.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at url{https://github.com/QwenLM/Qwen2-VL}.
{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","authors":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12191","DOIUrl":"https://doi.org/arxiv-2409.12191","url":null,"abstract":"We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL\u0000models that redefines the conventional predetermined-resolution approach in\u0000visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,\u0000which enables the model to dynamically process images of varying resolutions\u0000into different numbers of visual tokens. This approach allows the model to\u0000generate more efficient and accurate visual representations, closely aligning\u0000with human perceptual processes. The model also integrates Multimodal Rotary\u0000Position Embedding (M-RoPE), facilitating the effective fusion of positional\u0000information across text, images, and videos. We employ a unified paradigm for\u0000processing both images and videos, enhancing the model's visual perception\u0000capabilities. To explore the potential of large multimodal models, Qwen2-VL\u0000investigates the scaling laws for large vision-language models (LVLMs). By\u0000scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the\u0000amount of training data, the Qwen2-VL Series achieves highly competitive\u0000performance. Notably, the Qwen2-VL-72B model achieves results comparable to\u0000leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal\u0000benchmarks, outperforming other generalist models. Code is available at\u0000url{https://github.com/QwenLM/Qwen2-VL}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilian Fehrentz, Mohammad Farid Azampour, Reuben Dorent, Hassan Rasheed, Colin Galvin, Alexandra Golby, William M. Wells, Sarah Frisken, Nassir Navab, Nazim Haouchine
We present in this paper a novel approach for 3D/2D intraoperative registration during neurosurgery via cross-modal inverse neural rendering. Our approach separates implicit neural representation into two components, handling anatomical structure preoperatively and appearance intraoperatively. This disentanglement is achieved by controlling a Neural Radiance Field's appearance with a multi-style hypernetwork. Once trained, the implicit neural representation serves as a differentiable rendering engine, which can be used to estimate the surgical camera pose by minimizing the dissimilarity between its rendered images and the target intraoperative image. We tested our method on retrospective patients' data from clinical cases, showing that our method outperforms state-of-the-art while meeting current clinical standards for registration. Code and additional resources can be found at https://maxfehrentz.github.io/style-ngp/.
{"title":"Intraoperative Registration by Cross-Modal Inverse Neural Rendering","authors":"Maximilian Fehrentz, Mohammad Farid Azampour, Reuben Dorent, Hassan Rasheed, Colin Galvin, Alexandra Golby, William M. Wells, Sarah Frisken, Nassir Navab, Nazim Haouchine","doi":"arxiv-2409.11983","DOIUrl":"https://doi.org/arxiv-2409.11983","url":null,"abstract":"We present in this paper a novel approach for 3D/2D intraoperative\u0000registration during neurosurgery via cross-modal inverse neural rendering. Our\u0000approach separates implicit neural representation into two components, handling\u0000anatomical structure preoperatively and appearance intraoperatively. This\u0000disentanglement is achieved by controlling a Neural Radiance Field's appearance\u0000with a multi-style hypernetwork. Once trained, the implicit neural\u0000representation serves as a differentiable rendering engine, which can be used\u0000to estimate the surgical camera pose by minimizing the dissimilarity between\u0000its rendered images and the target intraoperative image. We tested our method\u0000on retrospective patients' data from clinical cases, showing that our method\u0000outperforms state-of-the-art while meeting current clinical standards for\u0000registration. Code and additional resources can be found at\u0000https://maxfehrentz.github.io/style-ngp/.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}