Remote Photoplethysmography (rPPG) is a non-contact method that uses facial video to predict changes in blood volume, enabling physiological metrics measurement. Traditional rPPG models often struggle with poor generalization capacity in unseen domains. Current solutions to this problem is to improve its generalization in the target domain through Domain Generalization (DG) or Domain Adaptation (DA). However, both traditional methods require access to both source domain data and target domain data, which cannot be implemented in scenarios with limited access to source data, and another issue is the privacy of accessing source domain data. In this paper, we propose the first Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which overcomes these limitations by enabling effective domain adaptation without access to source domain data. Our framework incorporates a Three-Branch Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency across domains. Furthermore, we propose a new rPPG distribution alignment loss based on the Frequency-domain Wasserstein Distance (FWD), which leverages optimal transport to align power spectrum distributions across domains effectively and further enforces the alignment of the three branches. Extensive cross-domain experiments and ablation studies demonstrate the effectiveness of our proposed method in source-free domain adaptation settings. Our findings highlight the significant contribution of the proposed FWD loss for distributional alignment, providing a valuable reference for future research and applications. The source code is available at https://github.com/XieYiping66/SFDA-rPPG
{"title":"SFDA-rPPG: Source-Free Domain Adaptive Remote Physiological Measurement with Spatio-Temporal Consistency","authors":"Yiping Xie, Zitong Yu, Bingjie Wu, Weicheng Xie, Linlin Shen","doi":"arxiv-2409.12040","DOIUrl":"https://doi.org/arxiv-2409.12040","url":null,"abstract":"Remote Photoplethysmography (rPPG) is a non-contact method that uses facial\u0000video to predict changes in blood volume, enabling physiological metrics\u0000measurement. Traditional rPPG models often struggle with poor generalization\u0000capacity in unseen domains. Current solutions to this problem is to improve its\u0000generalization in the target domain through Domain Generalization (DG) or\u0000Domain Adaptation (DA). However, both traditional methods require access to\u0000both source domain data and target domain data, which cannot be implemented in\u0000scenarios with limited access to source data, and another issue is the privacy\u0000of accessing source domain data. In this paper, we propose the first\u0000Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which\u0000overcomes these limitations by enabling effective domain adaptation without\u0000access to source domain data. Our framework incorporates a Three-Branch\u0000Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency\u0000across domains. Furthermore, we propose a new rPPG distribution alignment loss\u0000based on the Frequency-domain Wasserstein Distance (FWD), which leverages\u0000optimal transport to align power spectrum distributions across domains\u0000effectively and further enforces the alignment of the three branches. Extensive\u0000cross-domain experiments and ablation studies demonstrate the effectiveness of\u0000our proposed method in source-free domain adaptation settings. Our findings\u0000highlight the significant contribution of the proposed FWD loss for\u0000distributional alignment, providing a valuable reference for future research\u0000and applications. The source code is available at\u0000https://github.com/XieYiping66/SFDA-rPPG","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad
With the ever-growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real-world applications.
{"title":"Applications of Knowledge Distillation in Remote Sensing: A Survey","authors":"Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad","doi":"arxiv-2409.12111","DOIUrl":"https://doi.org/arxiv-2409.12111","url":null,"abstract":"With the ever-growing complexity of models in the field of remote sensing\u0000(RS), there is an increasing demand for solutions that balance model accuracy\u0000with computational efficiency. Knowledge distillation (KD) has emerged as a\u0000powerful tool to meet this need, enabling the transfer of knowledge from large,\u0000complex models to smaller, more efficient ones without significant loss in\u0000performance. This review article provides an extensive examination of KD and\u0000its innovative applications in RS. KD, a technique developed to transfer\u0000knowledge from a complex, often cumbersome model (teacher) to a more compact\u0000and efficient model (student), has seen significant evolution and application\u0000across various domains. Initially, we introduce the fundamental concepts and\u0000historical progression of KD methods. The advantages of employing KD are\u0000highlighted, particularly in terms of model compression, enhanced computational\u0000efficiency, and improved performance, which are pivotal for practical\u0000deployments in RS scenarios. The article provides a comprehensive taxonomy of\u0000KD techniques, where each category is critically analyzed to demonstrate the\u0000breadth and depth of the alternative options, and illustrates specific case\u0000studies that showcase the practical implementation of KD methods in RS tasks,\u0000such as instance segmentation and object detection. Further, the review\u0000discusses the challenges and limitations of KD in RS, including practical\u0000constraints and prospective future directions, providing a comprehensive\u0000overview for researchers and practitioners in the field of RS. Through this\u0000organization, the paper not only elucidates the current state of research in KD\u0000but also sets the stage for future research opportunities, thereby contributing\u0000significantly to both academic research and real-world applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The recent development of deep learning large models in medicine shows remarkable performance in medical image analysis and diagnosis, but their large number of parameters causes memory and inference latency challenges. Knowledge distillation offers a solution, but the slide-level gradients cannot be backpropagated for student model updates due to high-resolution pathological images and slide-level labels. This study presents an Efficient Fine-tuning on Compressed Models (EFCM) framework with two stages: unsupervised feature distillation and fine-tuning. In the distillation stage, Feature Projection Distillation (FPD) is proposed with a TransScan module for adaptive receptive field adjustment to enhance the knowledge absorption capability of the student model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM, Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are conducted on 11 downstream datasets related to three large medical models: RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The experimental results demonstrate that the EFCM framework significantly improves accuracy and efficiency in handling slide-level pathological image problems, effectively addressing the challenges of deploying large medical models. Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The analysis of model inference efficiency highlights the high efficiency of the distillation fine-tuning method.
{"title":"EFCM: Efficient Fine-tuning on Compressed Models for deployment of large models in medical image analysis","authors":"Shaojie Li, Zhaoshuo Diao","doi":"arxiv-2409.11817","DOIUrl":"https://doi.org/arxiv-2409.11817","url":null,"abstract":"The recent development of deep learning large models in medicine shows\u0000remarkable performance in medical image analysis and diagnosis, but their large\u0000number of parameters causes memory and inference latency challenges. Knowledge\u0000distillation offers a solution, but the slide-level gradients cannot be\u0000backpropagated for student model updates due to high-resolution pathological\u0000images and slide-level labels. This study presents an Efficient Fine-tuning on\u0000Compressed Models (EFCM) framework with two stages: unsupervised feature\u0000distillation and fine-tuning. In the distillation stage, Feature Projection\u0000Distillation (FPD) is proposed with a TransScan module for adaptive receptive\u0000field adjustment to enhance the knowledge absorption capability of the student\u0000model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM,\u0000Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are\u0000conducted on 11 downstream datasets related to three large medical models:\u0000RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The\u0000experimental results demonstrate that the EFCM framework significantly improves\u0000accuracy and efficiency in handling slide-level pathological image problems,\u0000effectively addressing the challenges of deploying large medical models.\u0000Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC\u0000compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The\u0000analysis of model inference efficiency highlights the high efficiency of the\u0000distillation fine-tuning method.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The intermittency of solar power, due to occlusion from cloud cover, is one of the key factors inhibiting its widespread use in both commercial and residential settings. Hence, real-time forecasting of solar irradiance for grid-connected photovoltaic systems is necessary to schedule and allocate resources across the grid. Ground-based imagers that capture wide field-of-view images of the sky are commonly used to monitor cloud movement around a particular site in an effort to forecast solar irradiance. However, these wide FOV imagers capture a distorted image of sky image, where regions near the horizon are heavily compressed. This hinders the ability to precisely predict cloud motion near the horizon which especially affects prediction over longer time horizons. In this work, we combat the aforementioned constraint by introducing a deep learning method to predict a future sky image frame with higher resolution than previous methods. Our main contribution is to derive an optimal warping method to counter the adverse affects of clouds at the horizon, and learn a framework for future sky image prediction which better determines cloud evolution for longer time horizons.
{"title":"Precise Forecasting of Sky Images Using Spatial Warping","authors":"Leron Julian, Aswin C. Sankaranarayanan","doi":"arxiv-2409.12162","DOIUrl":"https://doi.org/arxiv-2409.12162","url":null,"abstract":"The intermittency of solar power, due to occlusion from cloud cover, is one\u0000of the key factors inhibiting its widespread use in both commercial and\u0000residential settings. Hence, real-time forecasting of solar irradiance for\u0000grid-connected photovoltaic systems is necessary to schedule and allocate\u0000resources across the grid. Ground-based imagers that capture wide field-of-view\u0000images of the sky are commonly used to monitor cloud movement around a\u0000particular site in an effort to forecast solar irradiance. However, these wide\u0000FOV imagers capture a distorted image of sky image, where regions near the\u0000horizon are heavily compressed. This hinders the ability to precisely predict\u0000cloud motion near the horizon which especially affects prediction over longer\u0000time horizons. In this work, we combat the aforementioned constraint by\u0000introducing a deep learning method to predict a future sky image frame with\u0000higher resolution than previous methods. Our main contribution is to derive an\u0000optimal warping method to counter the adverse affects of clouds at the horizon,\u0000and learn a framework for future sky image prediction which better determines\u0000cloud evolution for longer time horizons.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Achieving human-like memory recall in artificial systems remains a challenging frontier in computer vision. Humans demonstrate remarkable ability to recall images after a single exposure, even after being shown thousands of images. However, this capacity diminishes significantly when confronted with non-natural stimuli such as random textures. In this paper, we present a method inspired by human memory processes to bridge this gap between artificial and biological memory systems. Our approach focuses on encoding images to mimic the high-level information retained by the human brain, rather than storing raw pixel data. By adding noise to images before encoding, we introduce variability akin to the non-deterministic nature of human memory encoding. Leveraging pre-trained models' embedding layers, we explore how different architectures encode images and their impact on memory recall. Our method achieves impressive results, with 97% accuracy on natural images and near-random performance (52%) on textures. We provide insights into the encoding process and its implications for machine learning memory systems, shedding light on the parallels between human and artificial intelligence memory mechanisms.
{"title":"Neural Encoding for Image Recall: Human-Like Memory","authors":"Virgile Foussereau, Robin Dumas","doi":"arxiv-2409.11750","DOIUrl":"https://doi.org/arxiv-2409.11750","url":null,"abstract":"Achieving human-like memory recall in artificial systems remains a\u0000challenging frontier in computer vision. Humans demonstrate remarkable ability\u0000to recall images after a single exposure, even after being shown thousands of\u0000images. However, this capacity diminishes significantly when confronted with\u0000non-natural stimuli such as random textures. In this paper, we present a method\u0000inspired by human memory processes to bridge this gap between artificial and\u0000biological memory systems. Our approach focuses on encoding images to mimic the\u0000high-level information retained by the human brain, rather than storing raw\u0000pixel data. By adding noise to images before encoding, we introduce variability\u0000akin to the non-deterministic nature of human memory encoding. Leveraging\u0000pre-trained models' embedding layers, we explore how different architectures\u0000encode images and their impact on memory recall. Our method achieves impressive\u0000results, with 97% accuracy on natural images and near-random performance (52%)\u0000on textures. We provide insights into the encoding process and its implications\u0000for machine learning memory systems, shedding light on the parallels between\u0000human and artificial intelligence memory mechanisms.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran
Over the past decade, there has been a steady advancement in enhancing face recognition algorithms leveraging advanced machine learning methods. The role of the loss function is pivotal in addressing face verification problems and playing a game-changing role. These loss functions have mainly explored variations among intra-class or inter-class separation. This research examines the natural phenomenon of facial symmetry in the face verification problem. The symmetry between the left and right hemi faces has been widely used in many research areas in recent decades. This paper adopts this simple approach judiciously by splitting the face image vertically into two halves. With the assumption that the natural phenomena of facial symmetry can enhance face verification methodology, we hypothesize that the two output embedding vectors of split faces must project close to each other in the output embedding space. Inspired by this concept, we penalize the network based on the disparity of embedding of the symmetrical pair of split faces. Symmetrical loss has the potential to minimize minor asymmetric features due to facial expression and lightning conditions, hence significantly increasing the inter-class variance among the classes and leading to more reliable face embedding. This loss function propels any network to outperform its baseline performance across all existing network architectures and configurations, enabling us to achieve SoTA results.
过去十年来,利用先进的机器学习方法改进人脸识别算法的工作取得了稳步进展。损失函数在解决人脸识别问题中起着举足轻重的作用,并扮演着改变游戏规则的角色。这些损失函数主要探讨了类内或类间分离的变化。本研究探讨了人脸验证问题中的人脸对称这一自然现象。近几十年来,左右半边脸对称已被广泛应用于许多研究领域。本文采用这种简单的方法,将人脸图像垂直分成两半。受这一概念的启发,我们根据对称的一对分割人脸的嵌入差异对网络进行惩罚。对称损失有可能最大限度地减少由于面部表情和光照条件而导致的轻微不对称特征,从而显著增加类间差异,实现更可靠的人脸嵌入。这种损耗功能可以推动任何网络在所有现有网络架构和配置中超越其基准性能,从而使我们获得 SoTA 结果。
{"title":"SymFace: Additional Facial Symmetry Loss for Deep Face Recognition","authors":"Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran","doi":"arxiv-2409.11816","DOIUrl":"https://doi.org/arxiv-2409.11816","url":null,"abstract":"Over the past decade, there has been a steady advancement in enhancing face\u0000recognition algorithms leveraging advanced machine learning methods. The role\u0000of the loss function is pivotal in addressing face verification problems and\u0000playing a game-changing role. These loss functions have mainly explored\u0000variations among intra-class or inter-class separation. This research examines\u0000the natural phenomenon of facial symmetry in the face verification problem. The\u0000symmetry between the left and right hemi faces has been widely used in many\u0000research areas in recent decades. This paper adopts this simple approach\u0000judiciously by splitting the face image vertically into two halves. With the\u0000assumption that the natural phenomena of facial symmetry can enhance face\u0000verification methodology, we hypothesize that the two output embedding vectors\u0000of split faces must project close to each other in the output embedding space.\u0000Inspired by this concept, we penalize the network based on the disparity of\u0000embedding of the symmetrical pair of split faces. Symmetrical loss has the\u0000potential to minimize minor asymmetric features due to facial expression and\u0000lightning conditions, hence significantly increasing the inter-class variance\u0000among the classes and leading to more reliable face embedding. This loss\u0000function propels any network to outperform its baseline performance across all\u0000existing network architectures and configurations, enabling us to achieve SoTA\u0000results.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li
End-to-end models are emerging as the mainstream in autonomous driving perception. However, the inability to meticulously deconstruct their internal mechanisms results in diminished development efficacy and impedes the establishment of trust. Pioneering in the issue, we present the Independent Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a novel framework that juxtaposes the module's feature maps against Ground Truth within a unified semantic Representation Space to quantify their similarity, thereby assessing the training maturity of individual functional modules. The core of the framework lies in the process of feature map encoding and representation aligning, facilitated by our proposed two-stage Alignment AutoEncoder, which ensures the preservation of salient information and the consistency of feature structure. The metric for evaluating the training maturity of functional modules, Similarity Score, demonstrates a robust positive correlation with BEV metrics, with an average correlation coefficient of 0.9387, attesting to the framework's reliability for assessment purposes.
{"title":"Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model","authors":"Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li","doi":"arxiv-2409.11969","DOIUrl":"https://doi.org/arxiv-2409.11969","url":null,"abstract":"End-to-end models are emerging as the mainstream in autonomous driving\u0000perception. However, the inability to meticulously deconstruct their internal\u0000mechanisms results in diminished development efficacy and impedes the\u0000establishment of trust. Pioneering in the issue, we present the Independent\u0000Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a\u0000novel framework that juxtaposes the module's feature maps against Ground Truth\u0000within a unified semantic Representation Space to quantify their similarity,\u0000thereby assessing the training maturity of individual functional modules. The\u0000core of the framework lies in the process of feature map encoding and\u0000representation aligning, facilitated by our proposed two-stage Alignment\u0000AutoEncoder, which ensures the preservation of salient information and the\u0000consistency of feature structure. The metric for evaluating the training\u0000maturity of functional modules, Similarity Score, demonstrates a robust\u0000positive correlation with BEV metrics, with an average correlation coefficient\u0000of 0.9387, attesting to the framework's reliability for assessment purposes.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The event camera has demonstrated significant success across a wide range of areas due to its low time latency and high dynamic range. However, the community faces challenges such as data deficiency and limited diversity, often resulting in over-fitting and inadequate feature learning. Notably, the exploration of data augmentation techniques in the event community remains scarce. This work aims to address this gap by introducing a systematic augmentation scheme named EventAug to enrich spatial-temporal diversity. In particular, we first propose Multi-scale Temporal Integration (MSTI) to diversify the motion speed of objects, then introduce Spatial-salient Event Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants. Our EventAug can facilitate models learning with richer motion patterns, object variants and local spatio-temporal relations, thus improving model robustness to varied moving speeds, occlusions, and action disruptions. Experiment results show that our augmentation method consistently yields significant improvements across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128 Gesture). Our code will be publicly available for this community.
{"title":"EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning","authors":"Yukun Tian, Hao Chen, Yongjian Deng, Feihong Shen, Kepan Liu, Wei You, Ziyang Zhang","doi":"arxiv-2409.11813","DOIUrl":"https://doi.org/arxiv-2409.11813","url":null,"abstract":"The event camera has demonstrated significant success across a wide range of\u0000areas due to its low time latency and high dynamic range. However, the\u0000community faces challenges such as data deficiency and limited diversity, often\u0000resulting in over-fitting and inadequate feature learning. Notably, the\u0000exploration of data augmentation techniques in the event community remains\u0000scarce. This work aims to address this gap by introducing a systematic\u0000augmentation scheme named EventAug to enrich spatial-temporal diversity. In\u0000particular, we first propose Multi-scale Temporal Integration (MSTI) to\u0000diversify the motion speed of objects, then introduce Spatial-salient Event\u0000Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants.\u0000Our EventAug can facilitate models learning with richer motion patterns, object\u0000variants and local spatio-temporal relations, thus improving model robustness\u0000to varied moving speeds, occlusions, and action disruptions. Experiment results\u0000show that our augmentation method consistently yields significant improvements\u0000across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128\u0000Gesture). Our code will be publicly available for this community.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at url{https://github.com/QwenLM/Qwen2-VL}.
{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","authors":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12191","DOIUrl":"https://doi.org/arxiv-2409.12191","url":null,"abstract":"We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL\u0000models that redefines the conventional predetermined-resolution approach in\u0000visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,\u0000which enables the model to dynamically process images of varying resolutions\u0000into different numbers of visual tokens. This approach allows the model to\u0000generate more efficient and accurate visual representations, closely aligning\u0000with human perceptual processes. The model also integrates Multimodal Rotary\u0000Position Embedding (M-RoPE), facilitating the effective fusion of positional\u0000information across text, images, and videos. We employ a unified paradigm for\u0000processing both images and videos, enhancing the model's visual perception\u0000capabilities. To explore the potential of large multimodal models, Qwen2-VL\u0000investigates the scaling laws for large vision-language models (LVLMs). By\u0000scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the\u0000amount of training data, the Qwen2-VL Series achieves highly competitive\u0000performance. Notably, the Qwen2-VL-72B model achieves results comparable to\u0000leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal\u0000benchmarks, outperforming other generalist models. Code is available at\u0000url{https://github.com/QwenLM/Qwen2-VL}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilian Fehrentz, Mohammad Farid Azampour, Reuben Dorent, Hassan Rasheed, Colin Galvin, Alexandra Golby, William M. Wells, Sarah Frisken, Nassir Navab, Nazim Haouchine
We present in this paper a novel approach for 3D/2D intraoperative registration during neurosurgery via cross-modal inverse neural rendering. Our approach separates implicit neural representation into two components, handling anatomical structure preoperatively and appearance intraoperatively. This disentanglement is achieved by controlling a Neural Radiance Field's appearance with a multi-style hypernetwork. Once trained, the implicit neural representation serves as a differentiable rendering engine, which can be used to estimate the surgical camera pose by minimizing the dissimilarity between its rendered images and the target intraoperative image. We tested our method on retrospective patients' data from clinical cases, showing that our method outperforms state-of-the-art while meeting current clinical standards for registration. Code and additional resources can be found at https://maxfehrentz.github.io/style-ngp/.
{"title":"Intraoperative Registration by Cross-Modal Inverse Neural Rendering","authors":"Maximilian Fehrentz, Mohammad Farid Azampour, Reuben Dorent, Hassan Rasheed, Colin Galvin, Alexandra Golby, William M. Wells, Sarah Frisken, Nassir Navab, Nazim Haouchine","doi":"arxiv-2409.11983","DOIUrl":"https://doi.org/arxiv-2409.11983","url":null,"abstract":"We present in this paper a novel approach for 3D/2D intraoperative\u0000registration during neurosurgery via cross-modal inverse neural rendering. Our\u0000approach separates implicit neural representation into two components, handling\u0000anatomical structure preoperatively and appearance intraoperatively. This\u0000disentanglement is achieved by controlling a Neural Radiance Field's appearance\u0000with a multi-style hypernetwork. Once trained, the implicit neural\u0000representation serves as a differentiable rendering engine, which can be used\u0000to estimate the surgical camera pose by minimizing the dissimilarity between\u0000its rendered images and the target intraoperative image. We tested our method\u0000on retrospective patients' data from clinical cases, showing that our method\u0000outperforms state-of-the-art while meeting current clinical standards for\u0000registration. Code and additional resources can be found at\u0000https://maxfehrentz.github.io/style-ngp/.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}