首页 > 最新文献

International Journal of Computer Vision最新文献

英文 中文
A Lightweight Hybrid Gabor Deep Learning Approach and its Application to Medical Image Classification 一种轻量级混合Gabor深度学习方法及其在医学图像分类中的应用
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-05 DOI: 10.1007/s11263-025-02658-2
Rayyan Ahmed, Hamza Baali, Abdesselam Bouzerdoum
Deep learning has revolutionized image analysis, but its applications are limited by the need for large datasets and high computational resources. Hybrid approaches that combine domain-specific, universal feature extractor with learnable neural networks offer a promising balance of efficiency and accuracy. This paper presents a hybrid model integrating a Gabor filter bank front-end with compact neural networks for efficient feature extraction and classification. Gabor filters, inherently bandpass, extract early-stage features with spatially shifted filters covering the frequency plane to balance spatial and spectral localization. We introduce separate channels capturing low- and high-frequency components to enhance feature representation while maintaining efficiency. The approach reduces trainable parameters and training time while preserving accuracy, making it suitable for resource-constrained environments. Compared to MobileNetV2 and EfficientNetB0, our model trains approximately 4–6 × faster on average while using fewer parameters and FLOPs. We compare it to pretrained networks used as feature extractors, lightweight fine-tuned models, and classical descriptors (HOG, LBP). It achieves competitive results with faster training and reduced computation. The hybrid model uses only around 0.60 GFLOPs and 0.34 M parameters, and we apply statistical significance testing (ANOVA, paired t-tests) to validate performance gains. Inference takes 0.01–0.02 s per image, up to 15 × faster than EfficientNetB0 and 8 × faster than MobileNetV2. Grad-CAM visualizations confirm localized attention on relevant regions. This work highlights integrating traditional features with deep learning to improve efficiency for resource-limited applications. Future work will address color fusion, robustness to noise, and automated filter optimization.
深度学习已经彻底改变了图像分析,但其应用受到对大型数据集和高计算资源的需求的限制。将特定领域的通用特征提取器与可学习神经网络相结合的混合方法提供了效率和准确性的良好平衡。本文提出了一种将Gabor滤波器组前端与紧凑神经网络相结合的混合模型,用于高效的特征提取和分类。Gabor滤波器本身是带通的,它通过覆盖频率平面的空间移位滤波器来提取早期特征,以平衡空间和频谱定位。我们引入了捕获低频和高频分量的单独通道,以增强特征表示,同时保持效率。该方法在保持准确性的同时减少了可训练参数和训练时间,使其适用于资源受限的环境。与MobileNetV2和EfficientNetB0相比,我们的模型在使用更少的参数和FLOPs的情况下,平均训练速度约为4-6倍。我们将其与用作特征提取器、轻量级微调模型和经典描述符(HOG、LBP)的预训练网络进行比较。它以更快的训练速度和更少的计算量获得了有竞争力的结果。混合模型仅使用约0.60 GFLOPs和0.34 M参数,我们应用统计显著性检验(ANOVA,配对t检验)来验证性能增益。每张图像的推理时间为0.01-0.02秒,比EfficientNetB0快15倍,比MobileNetV2快8倍。Grad-CAM可视化确认了对相关区域的局部关注。这项工作强调了将传统特征与深度学习相结合,以提高资源有限应用程序的效率。未来的工作将涉及颜色融合、对噪声的鲁棒性和自动滤波器优化。
{"title":"A Lightweight Hybrid Gabor Deep Learning Approach and its Application to Medical Image Classification","authors":"Rayyan Ahmed, Hamza Baali, Abdesselam Bouzerdoum","doi":"10.1007/s11263-025-02658-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02658-2","url":null,"abstract":"Deep learning has revolutionized image analysis, but its applications are limited by the need for large datasets and high computational resources. Hybrid approaches that combine domain-specific, universal feature extractor with learnable neural networks offer a promising balance of efficiency and accuracy. This paper presents a hybrid model integrating a Gabor filter bank front-end with compact neural networks for efficient feature extraction and classification. Gabor filters, inherently bandpass, extract early-stage features with spatially shifted filters covering the frequency plane to balance spatial and spectral localization. We introduce separate channels capturing low- and high-frequency components to enhance feature representation while maintaining efficiency. The approach reduces trainable parameters and training time while preserving accuracy, making it suitable for resource-constrained environments. Compared to MobileNetV2 and EfficientNetB0, our model trains approximately 4–6 × faster on average while using fewer parameters and FLOPs. We compare it to pretrained networks used as feature extractors, lightweight fine-tuned models, and classical descriptors (HOG, LBP). It achieves competitive results with faster training and reduced computation. The hybrid model uses only around 0.60 GFLOPs and 0.34 M parameters, and we apply statistical significance testing (ANOVA, paired t-tests) to validate performance gains. Inference takes 0.01–0.02 s per image, up to 15 × faster than EfficientNetB0 and 8 × faster than MobileNetV2. Grad-CAM visualizations confirm localized attention on relevant regions. This work highlights integrating traditional features with deep learning to improve efficiency for resource-limited applications. Future work will address color fusion, robustness to noise, and automated filter optimization.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"41 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145902471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration 从历史中学习:图像恢复的任务不可知论模型对比学习
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-05 DOI: 10.1007/s11263-025-02669-z
Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Wangmeng Zuo
{"title":"Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration","authors":"Gang Wu, Junjun Jiang, Kui Jiang, Xianming Liu, Wangmeng Zuo","doi":"10.1007/s11263-025-02669-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02669-z","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"1 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145902472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multi-Modal Knowledge-Driven Approach for Generalized Zero-shot Video Classification 一种多模态知识驱动的广义零镜头视频分类方法
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-04 DOI: 10.1007/s11263-025-02584-3
Mingyao Hong, Xinfeng Zhang, Guorong Li, Qingming Huang
{"title":"A Multi-Modal Knowledge-Driven Approach for Generalized Zero-shot Video Classification","authors":"Mingyao Hong, Xinfeng Zhang, Guorong Li, Qingming Huang","doi":"10.1007/s11263-025-02584-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02584-3","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CrowdMoGen: Event-Driven Collective Human Motion Generation CrowdMoGen:事件驱动的集体人体动作生成
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-04 DOI: 10.1007/s11263-025-02677-z
Yukang Cao, Xinying Guo, Mingyuan Zhang, Haozhe Xie, Chenyang Gu, Ziwei Liu
{"title":"CrowdMoGen: Event-Driven Collective Human Motion Generation","authors":"Yukang Cao, Xinying Guo, Mingyuan Zhang, Haozhe Xie, Chenyang Gu, Ziwei Liu","doi":"10.1007/s11263-025-02677-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02677-z","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"53 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COBRA: A Continual Learning Approach to Vision-Brain Understanding 柯博拉:视觉-大脑理解的持续学习方法
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-04 DOI: 10.1007/s11263-025-02617-x
Xuan-Bac Nguyen, Manuel Serna-Aguilera, Arabinda Kumar Choudhary, Pawan Sinha, Xin Li, Khoa Luu
{"title":"COBRA: A Continual Learning Approach to Vision-Brain Understanding","authors":"Xuan-Bac Nguyen, Manuel Serna-Aguilera, Arabinda Kumar Choudhary, Pawan Sinha, Xin Li, Khoa Luu","doi":"10.1007/s11263-025-02617-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02617-x","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"29 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multi-Granularity Scene-Aware Graph Convolution Method for Weakly Supervised Person Search 弱监督人搜索的多粒度场景感知图卷积方法
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-03 DOI: 10.1007/s11263-025-02665-3
De Cheng, Haichun Tai, Nannan Wang, Xiangqian Zhao, Jie Li, Xinbo Gao
{"title":"A Multi-Granularity Scene-Aware Graph Convolution Method for Weakly Supervised Person Search","authors":"De Cheng, Haichun Tai, Nannan Wang, Xiangqian Zhao, Jie Li, Xinbo Gao","doi":"10.1007/s11263-025-02665-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02665-3","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond 基于自监督骨架的动作表示学习:基准及超越
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02644-8
Jiahang Zhang, Lilang Lin, Shuai Yang, Jiaying Liu
Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.
自监督学习(SSL)旨在从未标记的数据中学习有意义的先验表示,已被证明对基于骨架的动作理解是有效的。与图像域不同,骨架数据具有更稀疏的空间结构和多样的表现形式,缺乏背景线索和额外的时间维度,给时空运动伪装任务设计带来了新的挑战。最近,人们对基于骨架的SSL进行了许多努力,取得了显著的进展。然而,仍然缺乏系统和彻底的审查。在本文中,我们首次对基于自监督骨架的动作表示学习进行了全面的研究。本文根据情境学习、生成学习和对比学习方法的分类,对现有的研究成果进行了全面的回顾和比较,并对未来可能的研究方向进行了展望。值得注意的是,我们的调查表明,大多数SSL工作依赖于单一范式,学习单一级别的表示,并且仅在动作识别任务上进行评估,这使得框架SSL模型的泛化能力未得到充分开发。为此,进一步提出了一种新颖有效的框架SSL方法,该方法集成了不同粒度的通用表示学习目标,大大提高了对多个框架下游任务的泛化能力。在三个大规模数据集上的大量实验表明,我们的方法在识别、检索、检测和few-shot学习等下游任务上取得了优异的泛化性能。
{"title":"Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond","authors":"Jiahang Zhang, Lilang Lin, Shuai Yang, Jiaying Liu","doi":"10.1007/s11263-025-02644-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02644-8","url":null,"abstract":"Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on <italic>self-supervised skeleton-based action representation learning</italic>. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including <italic>recognition, retrieval, detection, and few-shot learning</italic>.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"29 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection 交叉压缩率深度伪造检测的单模态生成多模态对比学习
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02606-0
Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu, Chiou-Ting Hsu, Chia-Wen Lin
In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.
在深度伪造检测中,社交媒体平台所采用的不同程度的压缩对模型的泛化和可靠性提出了重大挑战。虽然现有的方法已经从单模态发展到多模态,但它们面临着严重的局限性:单模态方法在社交媒体流数据压缩下存在特征退化问题,而多模态方法需要昂贵的数据收集和标记,并且在现实场景中存在模态质量不一致或可访问性不一致的问题。为了解决这些挑战,我们提出了一种新的单模生成的多模态对比学习(UMCL)框架,用于鲁棒交叉压缩率(CCR)深度假检测。在训练阶段,我们的方法将单个视觉模态转换为三个互补的特征:压缩鲁棒rPPG信号、时间地标动态和来自预训练视觉语言模型的语义嵌入。这些特征通过亲和驱动的语义对齐(ASA)策略明确对齐,该策略通过亲和矩阵建模多模态关系,并通过对比学习优化它们的一致性。随后,我们的跨质量相似学习(CQSL)策略增强了跨压缩率的特征鲁棒性。大量的实验表明,我们的方法在各种压缩率和操作类型中都取得了优异的性能,为鲁棒深度伪造检测建立了新的基准。值得注意的是,我们的方法即使在单个特征降级时也保持了很高的检测精度,同时通过显式对齐提供了对特征关系的可解释的见解。
{"title":"UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection","authors":"Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu, Chiou-Ting Hsu, Chia-Wen Lin","doi":"10.1007/s11263-025-02606-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02606-0","url":null,"abstract":"In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"125 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiple Instance Learning Framework with Masked Hard Instance Mining for Gigapixel Histopathology Image Analysis 基于屏蔽硬实例挖掘的多实例学习框架用于十亿像素组织病理学图像分析
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02587-0
Wenhao Tang, Sheng Huang, Heng Fang, Fengtao Zhou, Bo Liu, Qingshan Liu
Digitizing pathological images into gigapixel Whole Slide Images (WSIs) has opened new avenues for Computational Pathology (CPath). As positive tissue comprises only a small fraction of gigapixel WSIs, existing Multiple Instance Learning (MIL) methods typically focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting challenging ones. Recent studies have shown that hard examples are crucial for accurately modeling discriminative boundaries. Applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which utilizes a Siamese structure with a consistency constraint to explore the hard instances. Using a class-aware instance probability, MHIM-MIL employs a momentum teacher to mask salient instances and implicitly mine hard instances for training the student model. To obtain diverse, non-redundant hard instances, we adopt large-scale random masking while utilizing a global recycle network to mitigate the risk of losing key features. Furthermore, the student updates the teacher using an exponential moving average, which identifies new hard instances for subsequent training iterations and stabilizes optimization. Experimental results on cancer diagnosis, subtyping, survival analysis tasks, and 12 benchmarks demonstrate that MHIM-MIL outperforms the latest methods in both performance and efficiency. The code is available at: https://github.com/DearCaat/MHIM-MIL.
数字化病理图像成十亿像素的全幻灯片图像(WSIs)为计算病理学(CPath)开辟了新的途径。由于阳性组织仅占十亿像素wsi的一小部分,现有的多实例学习(MIL)方法通常侧重于通过注意机制识别显著实例。然而,这会导致人们倾向于容易分类的实例,而忽略了具有挑战性的实例。最近的研究表明,硬示例对于准确建模判别边界至关重要。在实例级应用这一思想,我们设计了一种新的MIL框架,该框架采用屏蔽硬实例挖掘(mhm -MIL),该框架利用带有一致性约束的Siamese结构来探索硬实例。使用类感知实例概率,MHIM-MIL使用动量教师来掩盖突出实例并隐式挖掘困难实例来训练学生模型。为了获得多样化、非冗余的硬实例,我们采用大规模随机屏蔽,同时利用全局回收网络来降低丢失关键特征的风险。此外,学生使用指数移动平均更新老师,这为后续的训练迭代识别新的困难实例并稳定优化。在癌症诊断、分型、生存分析任务和12个基准测试方面的实验结果表明,mhm - mil在性能和效率方面都优于最新方法。代码可从https://github.com/DearCaat/MHIM-MIL获得。
{"title":"Multiple Instance Learning Framework with Masked Hard Instance Mining for Gigapixel Histopathology Image Analysis","authors":"Wenhao Tang, Sheng Huang, Heng Fang, Fengtao Zhou, Bo Liu, Qingshan Liu","doi":"10.1007/s11263-025-02587-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02587-0","url":null,"abstract":"Digitizing pathological images into gigapixel Whole Slide Images (WSIs) has opened new avenues for Computational Pathology (CPath). As positive tissue comprises only a small fraction of gigapixel WSIs, existing Multiple Instance Learning (MIL) methods typically focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting challenging ones. Recent studies have shown that hard examples are crucial for accurately modeling discriminative boundaries. Applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which utilizes a Siamese structure with a consistency constraint to explore the hard instances. Using a class-aware instance probability, MHIM-MIL employs a momentum teacher to mask salient instances and implicitly mine hard instances for training the student model. To obtain diverse, non-redundant hard instances, we adopt large-scale random masking while utilizing a global recycle network to mitigate the risk of losing key features. Furthermore, the student updates the teacher using an exponential moving average, which identifies new hard instances for subsequent training iterations and stabilizes optimization. Experimental results on cancer diagnosis, subtyping, survival analysis tasks, and 12 benchmarks demonstrate that MHIM-MIL outperforms the latest methods in both performance and efficiency. The code is available at: <ext-link ext-link-type=\"uri\" xlink:href=\"https://github.com/DearCaat/MHIM-MIL\">https://github.com/DearCaat/MHIM-MIL</ext-link>.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"42 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RDG-GS: Relative Depth Guidance with Gaussian Splatting for Real-time Sparse-View 3D Rendering RDG-GS:用于实时稀疏视图3D渲染的高斯溅射相对深度指导
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02594-1
Chenlu Zhan, Yufei Zhang, Yu Lin, Gaoang Wang, Hongwei Wang
Efficiently synthesizing novel views from sparse inputs while maintaining accuracy remains a critical challenge in 3D reconstruction. While advanced techniques like radiance fields and 3D Gaussian Splatting achieve rendering quality and impressive efficiency with dense view inputs, they suffer from significant geometric reconstruction errors when applied to sparse input views. Moreover, although recent methods leveraging monocular depth estimation to enhance geometric learning, their dependence on single-view estimated depth often leads to view inconsistency issues across different viewpoints. Consequently, this reliance on absolute depth can introduce inaccuracies in geometric information, ultimately compromising the quality of scene reconstruction with Gaussian splats. In this paper, we present RDG-GS, a novel sparse-view 3D rendering framework with Relative Depth Guidance based on 3D Gaussian Splatting. The core innovation lies in utilizing relative depth guidance to refine the Gaussian field, steering it towards view-consistent spatial geometric representations, thereby enabling the reconstruction of accurate geometric structures and capturing intricate textures. First, we devise refined depth priors to rectify the coarse estimated depth and insert global and fine-grained scene information into regular Gaussians. Building on this, to address spatial geometric inaccuracies from absolute depth, we propose relative depth guidance by optimizing the similarity between spatially correlated patches of depth and images. Additionally, we also directly deal with the sparse areas challenging to converge by the adaptive sampling for quick densification. Across extensive experiments on Mip-NeRF360, LLFF, DTU, and Blender, RDG-GS demonstrates state-of-the-art rendering quality and efficiency, making a significant advancement for real-world applications.
有效地从稀疏输入合成新视图,同时保持精度仍然是三维重建的关键挑战。虽然像辐射场和3D高斯飞溅这样的先进技术在密集视图输入下实现了渲染质量和令人印象深刻的效率,但当应用于稀疏输入视图时,它们会遭受显著的几何重建误差。此外,尽管最近的方法利用单视角深度估计来增强几何学习,但它们对单视角估计深度的依赖往往导致不同视点之间的视图不一致问题。因此,这种对绝对深度的依赖可能会引入几何信息的不准确性,最终影响高斯拼板场景重建的质量。在本文中,我们提出了一种新的基于三维高斯溅射的具有相对深度引导的稀疏视图三维绘制框架RDG-GS。核心创新在于利用相对深度引导对高斯场进行细化,将其转向与视图一致的空间几何表示,从而能够重建精确的几何结构并捕获复杂的纹理。首先,我们设计了精细的深度先验来校正粗糙的估计深度,并将全局和细粒度的场景信息插入正则高斯分布中。在此基础上,为了解决绝对深度的空间几何不准确性,我们提出了通过优化空间相关深度斑块和图像之间的相似性来实现相对深度引导的方法。此外,我们还通过自适应采样直接处理难以收敛的稀疏区域,实现快速致密化。在Mip-NeRF360, LLFF, DTU和Blender上进行了广泛的实验,RDG-GS展示了最先进的渲染质量和效率,为现实世界的应用程序取得了重大进展。
{"title":"RDG-GS: Relative Depth Guidance with Gaussian Splatting for Real-time Sparse-View 3D Rendering","authors":"Chenlu Zhan, Yufei Zhang, Yu Lin, Gaoang Wang, Hongwei Wang","doi":"10.1007/s11263-025-02594-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02594-1","url":null,"abstract":"Efficiently synthesizing novel views from sparse inputs while maintaining accuracy remains a critical challenge in 3D reconstruction. While advanced techniques like radiance fields and 3D Gaussian Splatting achieve rendering quality and impressive efficiency with dense view inputs, they suffer from significant geometric reconstruction errors when applied to sparse input views. Moreover, although recent methods leveraging monocular depth estimation to enhance geometric learning, their dependence on single-view estimated depth often leads to view inconsistency issues across different viewpoints. Consequently, this reliance on absolute depth can introduce inaccuracies in geometric information, ultimately compromising the quality of scene reconstruction with Gaussian splats. In this paper, we present <bold>RDG-GS</bold>, a novel sparse-view 3D rendering framework with <bold>R</bold>elative <bold>D</bold>epth <bold>G</bold>uidance based on 3D <bold>G</bold>aussian <bold>S</bold>platting. The core innovation lies in utilizing relative depth guidance to refine the Gaussian field, steering it towards view-consistent spatial geometric representations, thereby enabling the reconstruction of accurate geometric structures and capturing intricate textures. First, we devise refined depth priors to rectify the coarse estimated depth and insert global and fine-grained scene information into regular Gaussians. Building on this, to address spatial geometric inaccuracies from absolute depth, we propose relative depth guidance by optimizing the similarity between spatially correlated patches of depth and images. Additionally, we also directly deal with the sparse areas challenging to converge by the adaptive sampling for quick densification. Across extensive experiments on Mip-NeRF360, LLFF, DTU, and Blender, RDG-GS demonstrates state-of-the-art rendering quality and efficiency, making a significant advancement for real-world applications.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"13 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1