arXiv - CS - Computer Vision and Pattern Recognition最新文献_第2页

SFDA-rPPG: Source-Free Domain Adaptive Remote Physiological Measurement with Spatio-Temporal Consistency SFDA-rPPG：具有时空一致性的无源域自适应远程生理测量技术

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12040

Yiping Xie, Zitong Yu, Bingjie Wu, Weicheng Xie, Linlin Shen

Remote Photoplethysmography (rPPG) is a non-contact method that uses facialvideo to predict changes in blood volume, enabling physiological metricsmeasurement. Traditional rPPG models often struggle with poor generalizationcapacity in unseen domains. Current solutions to this problem is to improve itsgeneralization in the target domain through Domain Generalization (DG) orDomain Adaptation (DA). However, both traditional methods require access toboth source domain data and target domain data, which cannot be implemented inscenarios with limited access to source data, and another issue is the privacyof accessing source domain data. In this paper, we propose the firstSource-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), whichovercomes these limitations by enabling effective domain adaptation withoutaccess to source domain data. Our framework incorporates a Three-BranchSpatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistencyacross domains. Furthermore, we propose a new rPPG distribution alignment lossbased on the Frequency-domain Wasserstein Distance (FWD), which leveragesoptimal transport to align power spectrum distributions across domainseffectively and further enforces the alignment of the three branches. Extensivecross-domain experiments and ablation studies demonstrate the effectiveness ofour proposed method in source-free domain adaptation settings. Our findingshighlight the significant contribution of the proposed FWD loss fordistributional alignment, providing a valuable reference for future researchand applications. The source code is available athttps://github.com/XieYiping66/SFDA-rPPG

远程血压计（rPPG）是一种非接触式方法，利用面部视频来预测血容量的变化，从而实现生理指标的测量。传统的 rPPG 模型在未知领域的泛化能力较差。目前解决这一问题的方法是通过领域泛化（DG）或领域适应（DA）来提高其在目标领域的泛化能力。然而，这两种传统方法都需要同时访问源域数据和目标域数据，这在源数据访问受限的情况下无法实现，另一个问题是访问源域数据的隐私性。在本文中，我们提出了首个用于 rPPG 测量的无源域自适应基准（SFDA-rPPG），通过在不访问源域数据的情况下实现有效的域自适应，克服了这些限制。我们的框架采用了三分支时空一致性网络（TSTC-Net）来增强跨域特征一致性。此外，我们还基于频域瓦瑟斯坦距离（FWD）提出了一种新的 rPPG 分布对齐损耗，它利用最佳传输来有效地对齐跨域的功率谱分布，并进一步加强了三个分支的对齐。广泛的跨域实验和消融研究证明了我们提出的方法在无源域适应设置中的有效性。我们的研究结果凸显了所提出的 FWD 损失对分布式配准的重大贡献，为未来的研究和应用提供了宝贵的参考。源代码请访问：https://github.com/XieYiping66/SFDA-rPPG

{"title":"SFDA-rPPG: Source-Free Domain Adaptive Remote Physiological Measurement with Spatio-Temporal Consistency","authors":"Yiping Xie, Zitong Yu, Bingjie Wu, Weicheng Xie, Linlin Shen","doi":"arxiv-2409.12040","DOIUrl":"https://doi.org/arxiv-2409.12040","url":null,"abstract":"Remote Photoplethysmography (rPPG) is a non-contact method that uses facial\u0000video to predict changes in blood volume, enabling physiological metrics\u0000measurement. Traditional rPPG models often struggle with poor generalization\u0000capacity in unseen domains. Current solutions to this problem is to improve its\u0000generalization in the target domain through Domain Generalization (DG) or\u0000Domain Adaptation (DA). However, both traditional methods require access to\u0000both source domain data and target domain data, which cannot be implemented in\u0000scenarios with limited access to source data, and another issue is the privacy\u0000of accessing source domain data. In this paper, we propose the first\u0000Source-free Domain Adaptation benchmark for rPPG measurement (SFDA-rPPG), which\u0000overcomes these limitations by enabling effective domain adaptation without\u0000access to source domain data. Our framework incorporates a Three-Branch\u0000Spatio-Temporal Consistency Network (TSTC-Net) to enhance feature consistency\u0000across domains. Furthermore, we propose a new rPPG distribution alignment loss\u0000based on the Frequency-domain Wasserstein Distance (FWD), which leverages\u0000optimal transport to align power spectrum distributions across domains\u0000effectively and further enforces the alignment of the three branches. Extensive\u0000cross-domain experiments and ablation studies demonstrate the effectiveness of\u0000our proposed method in source-free domain adaptation settings. Our findings\u0000highlight the significant contribution of the proposed FWD loss for\u0000distributional alignment, providing a valuable reference for future research\u0000and applications. The source code is available at\u0000https://github.com/XieYiping66/SFDA-rPPG","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Applications of Knowledge Distillation in Remote Sensing: A Survey 知识蒸馏在遥感中的应用：调查

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12111

Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad

With the ever-growing complexity of models in the field of remote sensing(RS), there is an increasing demand for solutions that balance model accuracywith computational efficiency. Knowledge distillation (KD) has emerged as apowerful tool to meet this need, enabling the transfer of knowledge from large,complex models to smaller, more efficient ones without significant loss inperformance. This review article provides an extensive examination of KD andits innovative applications in RS. KD, a technique developed to transferknowledge from a complex, often cumbersome model (teacher) to a more compactand efficient model (student), has seen significant evolution and applicationacross various domains. Initially, we introduce the fundamental concepts andhistorical progression of KD methods. The advantages of employing KD arehighlighted, particularly in terms of model compression, enhanced computationalefficiency, and improved performance, which are pivotal for practicaldeployments in RS scenarios. The article provides a comprehensive taxonomy ofKD techniques, where each category is critically analyzed to demonstrate thebreadth and depth of the alternative options, and illustrates specific casestudies that showcase the practical implementation of KD methods in RS tasks,such as instance segmentation and object detection. Further, the reviewdiscusses the challenges and limitations of KD in RS, including practicalconstraints and prospective future directions, providing a comprehensiveoverview for researchers and practitioners in the field of RS. Through thisorganization, the paper not only elucidates the current state of research in KDbut also sets the stage for future research opportunities, thereby contributingsignificantly to both academic research and real-world applications.

随着遥感（RS）领域模型的复杂性不断增加，人们对兼顾模型准确性和计算效率的解决方案的需求也越来越大。知识蒸馏（KD）是满足这一需求的有力工具，它能将知识从大型、复杂的模型转移到更小、更高效的模型中，而不会明显降低性能。这篇综述文章对 KD 及其在 RS 中的创新应用进行了广泛研究。KD 是一种将知识从复杂、繁琐的模型（教师）转移到更紧凑、更高效的模型（学生）的技术，在各个领域都有显著的发展和应用。首先，我们将介绍 KD 方法的基本概念和历史进程。文章强调了使用 KD 的优势，尤其是在模型压缩、提高计算效率和改善性能方面，这些优势对于 RS 场景中的实际部署至关重要。文章对 KD 技术进行了全面分类，对每一类技术都进行了批判性分析，以展示备选方案的广度和深度，并通过具体案例研究展示了 KD 方法在 RS 任务（如实例分割和对象检测）中的实际应用。此外，综述还讨论了 KD 在 RS 中面临的挑战和局限性，包括实际限制和未来发展方向，为 RS 领域的研究人员和从业人员提供了一个全面的视角。通过这样的组织，本文不仅阐明了 KD 的研究现状，还为未来的研究机会奠定了基础，从而为学术研究和实际应用做出了重要贡献。

{"title":"Applications of Knowledge Distillation in Remote Sensing: A Survey","authors":"Yassine Himeur, Nour Aburaed, Omar Elharrouss, Iraklis Varlamis, Shadi Atalla, Wathiq Mansoor, Hussain Al Ahmad","doi":"arxiv-2409.12111","DOIUrl":"https://doi.org/arxiv-2409.12111","url":null,"abstract":"With the ever-growing complexity of models in the field of remote sensing\u0000(RS), there is an increasing demand for solutions that balance model accuracy\u0000with computational efficiency. Knowledge distillation (KD) has emerged as a\u0000powerful tool to meet this need, enabling the transfer of knowledge from large,\u0000complex models to smaller, more efficient ones without significant loss in\u0000performance. This review article provides an extensive examination of KD and\u0000its innovative applications in RS. KD, a technique developed to transfer\u0000knowledge from a complex, often cumbersome model (teacher) to a more compact\u0000and efficient model (student), has seen significant evolution and application\u0000across various domains. Initially, we introduce the fundamental concepts and\u0000historical progression of KD methods. The advantages of employing KD are\u0000highlighted, particularly in terms of model compression, enhanced computational\u0000efficiency, and improved performance, which are pivotal for practical\u0000deployments in RS scenarios. The article provides a comprehensive taxonomy of\u0000KD techniques, where each category is critically analyzed to demonstrate the\u0000breadth and depth of the alternative options, and illustrates specific case\u0000studies that showcase the practical implementation of KD methods in RS tasks,\u0000such as instance segmentation and object detection. Further, the review\u0000discusses the challenges and limitations of KD in RS, including practical\u0000constraints and prospective future directions, providing a comprehensive\u0000overview for researchers and practitioners in the field of RS. Through this\u0000organization, the paper not only elucidates the current state of research in KD\u0000but also sets the stage for future research opportunities, thereby contributing\u0000significantly to both academic research and real-world applications.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"54 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EFCM: Efficient Fine-tuning on Compressed Models for deployment of large models in medical image analysis EFCM：压缩模型上的高效微调，用于在医学图像分析中部署大型模型

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11817

Shaojie Li, Zhaoshuo Diao

The recent development of deep learning large models in medicine showsremarkable performance in medical image analysis and diagnosis, but their largenumber of parameters causes memory and inference latency challenges. Knowledgedistillation offers a solution, but the slide-level gradients cannot bebackpropagated for student model updates due to high-resolution pathologicalimages and slide-level labels. This study presents an Efficient Fine-tuning onCompressed Models (EFCM) framework with two stages: unsupervised featuredistillation and fine-tuning. In the distillation stage, Feature ProjectionDistillation (FPD) is proposed with a TransScan module for adaptive receptivefield adjustment to enhance the knowledge absorption capability of the studentmodel. In the slide-level fine-tuning stage, three strategies (Reuse CLAM,Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments areconducted on 11 downstream datasets related to three large medical models:RETFound for retina, MRM for chest X-ray, and BROW for histopathology. Theexperimental results demonstrate that the EFCM framework significantly improvesaccuracy and efficiency in handling slide-level pathological image problems,effectively addressing the challenges of deploying large medical models.Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUCcompared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. Theanalysis of model inference efficiency highlights the high efficiency of thedistillation fine-tuning method.

近年来，医学领域深度学习大型模型的发展在医学图像分析和诊断中表现出了显著的性能，但其参数数量之大导致了内存和推理延迟方面的挑战。Knowledgedistillation 提供了一种解决方案，但由于高分辨率病理图像和幻灯片级标签，学生模型更新时无法回传幻灯片级梯度。本研究提出的压缩模型高效微调（EFCM）框架包含两个阶段：无监督特征蒸馏和微调。在蒸馏阶段，提出了特征投影蒸馏（Feature ProjectionDistillation，FPD），并使用 TransScan 模块进行自适应感受野调整，以增强学生模型的知识吸收能力。在滑动微调阶段，比较了三种策略（重用 CLAM、重新训练 CLAM 和端对端训练 CLAM (ETC)）。实验在三个大型医学模型的 11 个下游数据集上进行：视网膜模型 RETFound、胸部 X 光模型 MRM 和组织病理学模型 BROW。实验结果表明，EFCM 框架显著提高了处理幻灯片级病理图像问题的准确性和效率，有效解决了部署大型医学模型所面临的挑战。具体来说，在 TCGA-NSCLC 和 TCGA-BRCA 数据集上，与大型模型 BROW 相比，EFCM 框架的 ACC 提高了 4.33%，AUC 提高了 5.2%。对模型推断效率的分析凸显了蒸馏微调方法的高效性。

{"title":"EFCM: Efficient Fine-tuning on Compressed Models for deployment of large models in medical image analysis","authors":"Shaojie Li, Zhaoshuo Diao","doi":"arxiv-2409.11817","DOIUrl":"https://doi.org/arxiv-2409.11817","url":null,"abstract":"The recent development of deep learning large models in medicine shows\u0000remarkable performance in medical image analysis and diagnosis, but their large\u0000number of parameters causes memory and inference latency challenges. Knowledge\u0000distillation offers a solution, but the slide-level gradients cannot be\u0000backpropagated for student model updates due to high-resolution pathological\u0000images and slide-level labels. This study presents an Efficient Fine-tuning on\u0000Compressed Models (EFCM) framework with two stages: unsupervised feature\u0000distillation and fine-tuning. In the distillation stage, Feature Projection\u0000Distillation (FPD) is proposed with a TransScan module for adaptive receptive\u0000field adjustment to enhance the knowledge absorption capability of the student\u0000model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM,\u0000Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are\u0000conducted on 11 downstream datasets related to three large medical models:\u0000RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The\u0000experimental results demonstrate that the EFCM framework significantly improves\u0000accuracy and efficiency in handling slide-level pathological image problems,\u0000effectively addressing the challenges of deploying large medical models.\u0000Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC\u0000compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The\u0000analysis of model inference efficiency highlights the high efficiency of the\u0000distillation fine-tuning method.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Precise Forecasting of Sky Images Using Spatial Warping 利用空间扭曲技术精确预报天空图像

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12162

Leron Julian, Aswin C. Sankaranarayanan

The intermittency of solar power, due to occlusion from cloud cover, is oneof the key factors inhibiting its widespread use in both commercial andresidential settings. Hence, real-time forecasting of solar irradiance forgrid-connected photovoltaic systems is necessary to schedule and allocateresources across the grid. Ground-based imagers that capture wide field-of-viewimages of the sky are commonly used to monitor cloud movement around aparticular site in an effort to forecast solar irradiance. However, these wideFOV imagers capture a distorted image of sky image, where regions near thehorizon are heavily compressed. This hinders the ability to precisely predictcloud motion near the horizon which especially affects prediction over longertime horizons. In this work, we combat the aforementioned constraint byintroducing a deep learning method to predict a future sky image frame withhigher resolution than previous methods. Our main contribution is to derive anoptimal warping method to counter the adverse affects of clouds at the horizon,and learn a framework for future sky image prediction which better determinescloud evolution for longer time horizons.

由于云层遮挡，太阳能发电具有间歇性，这是阻碍其在商业和住宅环境中广泛使用的关键因素之一。因此，有必要对并网光伏系统的太阳辐照度进行实时预测，以便在整个电网中安排和分配资源。捕捉天空宽视场图像的地基成像仪通常用于监测特定地点周围的云层移动，以预测太阳辐照度。然而，这些宽视场成像仪捕捉到的天空图像是扭曲的，地平线附近的区域被严重压缩。这就妨碍了精确预测地平线附近云层运动的能力，尤其影响了对长时间地平线的预测。在这项工作中，我们引入了一种深度学习方法来预测未来天空图像帧，其分辨率比以前的方法更高，从而克服了上述限制。我们的主要贡献是推导出一种最佳的扭曲方法来应对地平线上云层的不利影响，并学习了一种未来天空图像预测框架，该框架能更好地确定更长时间范围内的云层演变情况。

{"title":"Precise Forecasting of Sky Images Using Spatial Warping","authors":"Leron Julian, Aswin C. Sankaranarayanan","doi":"arxiv-2409.12162","DOIUrl":"https://doi.org/arxiv-2409.12162","url":null,"abstract":"The intermittency of solar power, due to occlusion from cloud cover, is one\u0000of the key factors inhibiting its widespread use in both commercial and\u0000residential settings. Hence, real-time forecasting of solar irradiance for\u0000grid-connected photovoltaic systems is necessary to schedule and allocate\u0000resources across the grid. Ground-based imagers that capture wide field-of-view\u0000images of the sky are commonly used to monitor cloud movement around a\u0000particular site in an effort to forecast solar irradiance. However, these wide\u0000FOV imagers capture a distorted image of sky image, where regions near the\u0000horizon are heavily compressed. This hinders the ability to precisely predict\u0000cloud motion near the horizon which especially affects prediction over longer\u0000time horizons. In this work, we combat the aforementioned constraint by\u0000introducing a deep learning method to predict a future sky image frame with\u0000higher resolution than previous methods. Our main contribution is to derive an\u0000optimal warping method to counter the adverse affects of clouds at the horizon,\u0000and learn a framework for future sky image prediction which better determines\u0000cloud evolution for longer time horizons.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural Encoding for Image Recall: Human-Like Memory 图像再现的神经编码：类人记忆

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11750

Virgile Foussereau, Robin Dumas

Achieving human-like memory recall in artificial systems remains achallenging frontier in computer vision. Humans demonstrate remarkable abilityto recall images after a single exposure, even after being shown thousands ofimages. However, this capacity diminishes significantly when confronted withnon-natural stimuli such as random textures. In this paper, we present a methodinspired by human memory processes to bridge this gap between artificial andbiological memory systems. Our approach focuses on encoding images to mimic thehigh-level information retained by the human brain, rather than storing rawpixel data. By adding noise to images before encoding, we introduce variabilityakin to the non-deterministic nature of human memory encoding. Leveragingpre-trained models' embedding layers, we explore how different architecturesencode images and their impact on memory recall. Our method achieves impressiveresults, with 97% accuracy on natural images and near-random performance (52%)on textures. We provide insights into the encoding process and its implicationsfor machine learning memory systems, shedding light on the parallels betweenhuman and artificial intelligence memory mechanisms.

在人工系统中实现类似人类的记忆回忆能力，仍然是计算机视觉领域的一个挑战性前沿领域。人类在一次曝光后就能回忆起图像，即使是在展示了数千张图像之后，也能表现出非凡的能力。然而，当面对随机纹理等非自然刺激时，这种能力就会大大减弱。在本文中，我们提出了一种受人类记忆过程启发的方法，以弥合人工记忆系统与生物记忆系统之间的差距。我们的方法侧重于对图像进行编码，以模仿人脑保留的高层次信息，而不是存储原始像素数据。通过在编码前为图像添加噪声，我们将人类记忆编码的非确定性引入了可变性。利用预先训练好的模型嵌入层，我们探索了不同架构如何编码图像及其对记忆回忆的影响。我们的方法取得了令人印象深刻的结果，在自然图像上的准确率为 97%，在纹理上的准确率接近随机表现（52%）。我们深入探讨了编码过程及其对机器学习记忆系统的影响，揭示了人类记忆机制与人工智能记忆机制之间的相似之处。

{"title":"Neural Encoding for Image Recall: Human-Like Memory","authors":"Virgile Foussereau, Robin Dumas","doi":"arxiv-2409.11750","DOIUrl":"https://doi.org/arxiv-2409.11750","url":null,"abstract":"Achieving human-like memory recall in artificial systems remains a\u0000challenging frontier in computer vision. Humans demonstrate remarkable ability\u0000to recall images after a single exposure, even after being shown thousands of\u0000images. However, this capacity diminishes significantly when confronted with\u0000non-natural stimuli such as random textures. In this paper, we present a method\u0000inspired by human memory processes to bridge this gap between artificial and\u0000biological memory systems. Our approach focuses on encoding images to mimic the\u0000high-level information retained by the human brain, rather than storing raw\u0000pixel data. By adding noise to images before encoding, we introduce variability\u0000akin to the non-deterministic nature of human memory encoding. Leveraging\u0000pre-trained models' embedding layers, we explore how different architectures\u0000encode images and their impact on memory recall. Our method achieves impressive\u0000results, with 97% accuracy on natural images and near-random performance (52%)\u0000on textures. We provide insights into the encoding process and its implications\u0000for machine learning memory systems, shedding light on the parallels between\u0000human and artificial intelligence memory mechanisms.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SymFace: Additional Facial Symmetry Loss for Deep Face Recognition SymFace：深度人脸识别的额外面部对称性损失

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11816

Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran

Over the past decade, there has been a steady advancement in enhancing facerecognition algorithms leveraging advanced machine learning methods. The roleof the loss function is pivotal in addressing face verification problems andplaying a game-changing role. These loss functions have mainly exploredvariations among intra-class or inter-class separation. This research examinesthe natural phenomenon of facial symmetry in the face verification problem. Thesymmetry between the left and right hemi faces has been widely used in manyresearch areas in recent decades. This paper adopts this simple approachjudiciously by splitting the face image vertically into two halves. With theassumption that the natural phenomena of facial symmetry can enhance faceverification methodology, we hypothesize that the two output embedding vectorsof split faces must project close to each other in the output embedding space.Inspired by this concept, we penalize the network based on the disparity ofembedding of the symmetrical pair of split faces. Symmetrical loss has thepotential to minimize minor asymmetric features due to facial expression andlightning conditions, hence significantly increasing the inter-class varianceamong the classes and leading to more reliable face embedding. This lossfunction propels any network to outperform its baseline performance across allexisting network architectures and configurations, enabling us to achieve SoTAresults.

过去十年来，利用先进的机器学习方法改进人脸识别算法的工作取得了稳步进展。损失函数在解决人脸识别问题中起着举足轻重的作用，并扮演着改变游戏规则的角色。这些损失函数主要探讨了类内或类间分离的变化。本研究探讨了人脸验证问题中的人脸对称这一自然现象。近几十年来，左右半边脸对称已被广泛应用于许多研究领域。本文采用这种简单的方法，将人脸图像垂直分成两半。受这一概念的启发，我们根据对称的一对分割人脸的嵌入差异对网络进行惩罚。对称损失有可能最大限度地减少由于面部表情和光照条件而导致的轻微不对称特征，从而显著增加类间差异，实现更可靠的人脸嵌入。这种损耗功能可以推动任何网络在所有现有网络架构和配置中超越其基准性能，从而使我们获得 SoTA 结果。

{"title":"SymFace: Additional Facial Symmetry Loss for Deep Face Recognition","authors":"Pritesh Prakash, Koteswar Rao Jerripothula, Ashish Jacob Sam, Prinsh Kumar Singh, S Umamaheswaran","doi":"arxiv-2409.11816","DOIUrl":"https://doi.org/arxiv-2409.11816","url":null,"abstract":"Over the past decade, there has been a steady advancement in enhancing face\u0000recognition algorithms leveraging advanced machine learning methods. The role\u0000of the loss function is pivotal in addressing face verification problems and\u0000playing a game-changing role. These loss functions have mainly explored\u0000variations among intra-class or inter-class separation. This research examines\u0000the natural phenomenon of facial symmetry in the face verification problem. The\u0000symmetry between the left and right hemi faces has been widely used in many\u0000research areas in recent decades. This paper adopts this simple approach\u0000judiciously by splitting the face image vertically into two halves. With the\u0000assumption that the natural phenomena of facial symmetry can enhance face\u0000verification methodology, we hypothesize that the two output embedding vectors\u0000of split faces must project close to each other in the output embedding space.\u0000Inspired by this concept, we penalize the network based on the disparity of\u0000embedding of the symmetrical pair of split faces. Symmetrical loss has the\u0000potential to minimize minor asymmetric features due to facial expression and\u0000lightning conditions, hence significantly increasing the inter-class variance\u0000among the classes and leading to more reliable face embedding. This loss\u0000function propels any network to outperform its baseline performance across all\u0000existing network architectures and configurations, enabling us to achieve SoTA\u0000results.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model 揭开黑盒的面纱：鸟瞰感知模型的独立功能模块评估

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11969

Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li

End-to-end models are emerging as the mainstream in autonomous drivingperception. However, the inability to meticulously deconstruct their internalmechanisms results in diminished development efficacy and impedes theestablishment of trust. Pioneering in the issue, we present the IndependentFunctional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), anovel framework that juxtaposes the module's feature maps against Ground Truthwithin a unified semantic Representation Space to quantify their similarity,thereby assessing the training maturity of individual functional modules. Thecore of the framework lies in the process of feature map encoding andrepresentation aligning, facilitated by our proposed two-stage AlignmentAutoEncoder, which ensures the preservation of salient information and theconsistency of feature structure. The metric for evaluating the trainingmaturity of functional modules, Similarity Score, demonstrates a robustpositive correlation with BEV metrics, with an average correlation coefficientof 0.9387, attesting to the framework's reliability for assessment purposes.

端到端模型正在成为自动驾驶感知领域的主流。然而，由于无法细致地解构其内部机制，导致开发效率降低，并阻碍了信任的建立。在这一问题上，我们率先提出了鸟瞰感知模型的独立功能模块评估（BEV-IFME），这是一个新颖的框架，它将模块的特征图与统一语义表征空间中的地面真理（Ground Truth）并列，量化它们的相似性，从而评估单个功能模块的训练成熟度。该框架的核心在于特征图编码和表征对齐的过程，我们提出的两阶段对齐自动编码器（AlignmentAutoEncoder）确保了突出信息的保留和特征结构的一致性。评估功能模块训练成熟度的指标 "相似度得分 "与 BEV 指标呈稳健的正相关，平均相关系数为 0.9387，证明了该框架在评估方面的可靠性。

{"title":"Unveiling the Black Box: Independent Functional Module Evaluation for Bird's-Eye-View Perception Model","authors":"Ludan Zhang, Xiaokang Ding, Yuqi Dai, Lei He, Keqiang Li","doi":"arxiv-2409.11969","DOIUrl":"https://doi.org/arxiv-2409.11969","url":null,"abstract":"End-to-end models are emerging as the mainstream in autonomous driving\u0000perception. However, the inability to meticulously deconstruct their internal\u0000mechanisms results in diminished development efficacy and impedes the\u0000establishment of trust. Pioneering in the issue, we present the Independent\u0000Functional Module Evaluation for Bird's-Eye-View Perception Model (BEV-IFME), a\u0000novel framework that juxtaposes the module's feature maps against Ground Truth\u0000within a unified semantic Representation Space to quantify their similarity,\u0000thereby assessing the training maturity of individual functional modules. The\u0000core of the framework lies in the process of feature map encoding and\u0000representation aligning, facilitated by our proposed two-stage Alignment\u0000AutoEncoder, which ensures the preservation of salient information and the\u0000consistency of feature structure. The metric for evaluating the training\u0000maturity of functional modules, Similarity Score, demonstrates a robust\u0000positive correlation with BEV metrics, with an average correlation coefficient\u0000of 0.9387, attesting to the framework's reliability for assessment purposes.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning EventAug：基于事件学习的多方面时空数据增强方法

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11813

Yukun Tian, Hao Chen, Yongjian Deng, Feihong Shen, Kepan Liu, Wei You, Ziyang Zhang

The event camera has demonstrated significant success across a wide range ofareas due to its low time latency and high dynamic range. However, thecommunity faces challenges such as data deficiency and limited diversity, oftenresulting in over-fitting and inadequate feature learning. Notably, theexploration of data augmentation techniques in the event community remainsscarce. This work aims to address this gap by introducing a systematicaugmentation scheme named EventAug to enrich spatial-temporal diversity. Inparticular, we first propose Multi-scale Temporal Integration (MSTI) todiversify the motion speed of objects, then introduce Spatial-salient EventMask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants.Our EventAug can facilitate models learning with richer motion patterns, objectvariants and local spatio-temporal relations, thus improving model robustnessto varied moving speeds, occlusions, and action disruptions. Experiment resultsshow that our augmentation method consistently yields significant improvementsacross different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128Gesture). Our code will be publicly available for this community.

事件相机因其低时间延迟和高动态范围而在众多领域取得了巨大成功。然而，该领域面临着数据不足和多样性有限等挑战，往往导致过度拟合和特征学习不足。值得注意的是，事件社区对数据增强技术的探索仍然匮乏。这项工作旨在通过引入一种名为 EventAug 的系统增强方案来填补这一空白，从而丰富时空多样性。具体而言，我们首先提出了多尺度时空整合（MSTI）来分散物体的运动速度，然后引入空间梯度事件掩码（SSEM）和时间梯度事件掩码（TSEM）来丰富物体的变体。我们的 EventAug 可以促进模型学习更丰富的运动模式、物体变体和局部时空关系，从而提高模型对不同运动速度、遮挡和动作干扰的鲁棒性。实验结果表明，我们的增强方法在不同任务和骨干上都有显著提高（例如，在 DVS128Gesture 上的准确率提高了 4.87%）。我们的代码将面向社会公开。

{"title":"EventAug: Multifaceted Spatio-Temporal Data Augmentation Methods for Event-based Learning","authors":"Yukun Tian, Hao Chen, Yongjian Deng, Feihong Shen, Kepan Liu, Wei You, Ziyang Zhang","doi":"arxiv-2409.11813","DOIUrl":"https://doi.org/arxiv-2409.11813","url":null,"abstract":"The event camera has demonstrated significant success across a wide range of\u0000areas due to its low time latency and high dynamic range. However, the\u0000community faces challenges such as data deficiency and limited diversity, often\u0000resulting in over-fitting and inadequate feature learning. Notably, the\u0000exploration of data augmentation techniques in the event community remains\u0000scarce. This work aims to address this gap by introducing a systematic\u0000augmentation scheme named EventAug to enrich spatial-temporal diversity. In\u0000particular, we first propose Multi-scale Temporal Integration (MSTI) to\u0000diversify the motion speed of objects, then introduce Spatial-salient Event\u0000Mask (SSEM) and Temporal-salient Event Mask (TSEM) to enrich object variants.\u0000Our EventAug can facilitate models learning with richer motion patterns, object\u0000variants and local spatio-temporal relations, thus improving model robustness\u0000to varied moving speeds, occlusions, and action disruptions. Experiment results\u0000show that our augmentation method consistently yields significant improvements\u0000across different tasks and backbones (e.g., a 4.87% accuracy gain on DVS128\u0000Gesture). Our code will be publicly available for this community.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Qwen2-VL：增强视觉语言模型在任何分辨率下对世界的感知能力

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.12191

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VLmodels that redefines the conventional predetermined-resolution approach invisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,which enables the model to dynamically process images of varying resolutionsinto different numbers of visual tokens. This approach allows the model togenerate more efficient and accurate visual representations, closely aligningwith human perceptual processes. The model also integrates Multimodal RotaryPosition Embedding (M-RoPE), facilitating the effective fusion of positionalinformation across text, images, and videos. We employ a unified paradigm forprocessing both images and videos, enhancing the model's visual perceptioncapabilities. To explore the potential of large multimodal models, Qwen2-VLinvestigates the scaling laws for large vision-language models (LVLMs). Byscaling both the model size-with versions at 2B, 8B, and 72B parameters-and theamount of training data, the Qwen2-VL Series achieves highly competitiveperformance. Notably, the Qwen2-VL-72B model achieves results comparable toleading models such as GPT-4o and Claude3.5-Sonnet across various multimodalbenchmarks, outperforming other generalist models. Code is available aturl{https://github.com/QwenLM/Qwen2-VL}.

我们推出的 Qwen2-VL 系列是之前 Qwen-VL 模型的高级升级版，它重新定义了传统的预定分辨率视觉处理方法。Qwen2-VL 引入了 Naive 动态分辨率机制，使模型能够动态地将不同分辨率的图像处理成不同数量的视觉标记。这种方法能让模型生成更高效、更准确的视觉表征，与人类的感知过程紧密结合。该模型还集成了多模态旋转位置嵌入（M-RoPE）技术，有助于有效融合文本、图像和视频中的位置信息。我们采用统一的范式处理图像和视频，增强了模型的视觉感知能力。为了探索大型多模态模型的潜力，Qwen2-VL 研究了大型视觉语言模型（LVLM）的缩放规律。通过对模型大小（2B、8B 和 72B 参数版本）和训练数据量进行缩放，Qwen2-VL 系列取得了极具竞争力的性能。值得注意的是，Qwen2-VL-72B 模型在各种多模态基准测试中取得的结果可与 GPT-4o 和 Claude3.5-Sonnet 等领先模型相媲美，表现优于其他通用模型。代码可在（url{https://github.com/QwenLM/Qwen2-VL}.

{"title":"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution","authors":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12191","DOIUrl":"https://doi.org/arxiv-2409.12191","url":null,"abstract":"We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL\u0000models that redefines the conventional predetermined-resolution approach in\u0000visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,\u0000which enables the model to dynamically process images of varying resolutions\u0000into different numbers of visual tokens. This approach allows the model to\u0000generate more efficient and accurate visual representations, closely aligning\u0000with human perceptual processes. The model also integrates Multimodal Rotary\u0000Position Embedding (M-RoPE), facilitating the effective fusion of positional\u0000information across text, images, and videos. We employ a unified paradigm for\u0000processing both images and videos, enhancing the model's visual perception\u0000capabilities. To explore the potential of large multimodal models, Qwen2-VL\u0000investigates the scaling laws for large vision-language models (LVLMs). By\u0000scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the\u0000amount of training data, the Qwen2-VL Series achieves highly competitive\u0000performance. Notably, the Qwen2-VL-72B model achieves results comparable to\u0000leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal\u0000benchmarks, outperforming other generalist models. Code is available at\u0000url{https://github.com/QwenLM/Qwen2-VL}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142250522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Intraoperative Registration by Cross-Modal Inverse Neural Rendering 通过跨模态反向神经渲染进行术中配准

arXiv - CS - Computer Vision and Pattern Recognition

Pub Date : 2024-09-18 DOI: arxiv-2409.11983

Maximilian Fehrentz, Mohammad Farid Azampour, Reuben Dorent, Hassan Rasheed, Colin Galvin, Alexandra Golby, William M. Wells, Sarah Frisken, Nassir Navab, Nazim Haouchine

We present in this paper a novel approach for 3D/2D intraoperativeregistration during neurosurgery via cross-modal inverse neural rendering. Ourapproach separates implicit neural representation into two components, handlinganatomical structure preoperatively and appearance intraoperatively. Thisdisentanglement is achieved by controlling a Neural Radiance Field's appearancewith a multi-style hypernetwork. Once trained, the implicit neuralrepresentation serves as a differentiable rendering engine, which can be usedto estimate the surgical camera pose by minimizing the dissimilarity betweenits rendered images and the target intraoperative image. We tested our methodon retrospective patients' data from clinical cases, showing that our methodoutperforms state-of-the-art while meeting current clinical standards forregistration. Code and additional resources can be found athttps://maxfehrentz.github.io/style-ngp/.

我们在本文中介绍了一种在神经外科手术中通过跨模态反向神经渲染进行 3D/2D 术中定位的新方法。我们的方法将隐式神经表征分离成两个部分，即术前处理解剖结构和术中处理外观。这种分离是通过使用多风格超网络控制神经辐射场的外观来实现的。训练完成后，隐式神经呈现可作为可区分的渲染引擎，通过最小化其渲染图像与术中目标图像之间的不相似度来估计手术相机的姿势。我们在临床病例的回顾性患者数据上测试了我们的方法，结果表明我们的方法优于最先进的方法，同时符合当前的临床注册标准。代码和其他资源可在https://maxfehrentz.github.io/style-ngp/。

引用次数: 0