首页 > 最新文献

International Journal of Computer Vision最新文献

英文 中文
A Multi-Modal Knowledge-Driven Approach for Generalized Zero-shot Video Classification 一种多模态知识驱动的广义零镜头视频分类方法
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-04 DOI: 10.1007/s11263-025-02584-3
Mingyao Hong, Xinfeng Zhang, Guorong Li, Qingming Huang
{"title":"A Multi-Modal Knowledge-Driven Approach for Generalized Zero-shot Video Classification","authors":"Mingyao Hong, Xinfeng Zhang, Guorong Li, Qingming Huang","doi":"10.1007/s11263-025-02584-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02584-3","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"33 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CrowdMoGen: Event-Driven Collective Human Motion Generation CrowdMoGen:事件驱动的集体人体动作生成
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-04 DOI: 10.1007/s11263-025-02677-z
Yukang Cao, Xinying Guo, Mingyuan Zhang, Haozhe Xie, Chenyang Gu, Ziwei Liu
{"title":"CrowdMoGen: Event-Driven Collective Human Motion Generation","authors":"Yukang Cao, Xinying Guo, Mingyuan Zhang, Haozhe Xie, Chenyang Gu, Ziwei Liu","doi":"10.1007/s11263-025-02677-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02677-z","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"53 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COBRA: A Continual Learning Approach to Vision-Brain Understanding 柯博拉:视觉-大脑理解的持续学习方法
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-04 DOI: 10.1007/s11263-025-02617-x
Xuan-Bac Nguyen, Manuel Serna-Aguilera, Arabinda Kumar Choudhary, Pawan Sinha, Xin Li, Khoa Luu
{"title":"COBRA: A Continual Learning Approach to Vision-Brain Understanding","authors":"Xuan-Bac Nguyen, Manuel Serna-Aguilera, Arabinda Kumar Choudhary, Pawan Sinha, Xin Li, Khoa Luu","doi":"10.1007/s11263-025-02617-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02617-x","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"29 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Multi-Granularity Scene-Aware Graph Convolution Method for Weakly Supervised Person Search 弱监督人搜索的多粒度场景感知图卷积方法
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-03 DOI: 10.1007/s11263-025-02665-3
De Cheng, Haichun Tai, Nannan Wang, Xiangqian Zhao, Jie Li, Xinbo Gao
{"title":"A Multi-Granularity Scene-Aware Graph Convolution Method for Weakly Supervised Person Search","authors":"De Cheng, Haichun Tai, Nannan Wang, Xiangqian Zhao, Jie Li, Xinbo Gao","doi":"10.1007/s11263-025-02665-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02665-3","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145894691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond 基于自监督骨架的动作表示学习:基准及超越
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02644-8
Jiahang Zhang, Lilang Lin, Shuai Yang, Jiaying Liu
Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.
自监督学习(SSL)旨在从未标记的数据中学习有意义的先验表示,已被证明对基于骨架的动作理解是有效的。与图像域不同,骨架数据具有更稀疏的空间结构和多样的表现形式,缺乏背景线索和额外的时间维度,给时空运动伪装任务设计带来了新的挑战。最近,人们对基于骨架的SSL进行了许多努力,取得了显著的进展。然而,仍然缺乏系统和彻底的审查。在本文中,我们首次对基于自监督骨架的动作表示学习进行了全面的研究。本文根据情境学习、生成学习和对比学习方法的分类,对现有的研究成果进行了全面的回顾和比较,并对未来可能的研究方向进行了展望。值得注意的是,我们的调查表明,大多数SSL工作依赖于单一范式,学习单一级别的表示,并且仅在动作识别任务上进行评估,这使得框架SSL模型的泛化能力未得到充分开发。为此,进一步提出了一种新颖有效的框架SSL方法,该方法集成了不同粒度的通用表示学习目标,大大提高了对多个框架下游任务的泛化能力。在三个大规模数据集上的大量实验表明,我们的方法在识别、检索、检测和few-shot学习等下游任务上取得了优异的泛化性能。
{"title":"Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond","authors":"Jiahang Zhang, Lilang Lin, Shuai Yang, Jiaying Liu","doi":"10.1007/s11263-025-02644-8","DOIUrl":"https://doi.org/10.1007/s11263-025-02644-8","url":null,"abstract":"Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on <italic>self-supervised skeleton-based action representation learning</italic>. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including <italic>recognition, retrieval, detection, and few-shot learning</italic>.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"29 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection 交叉压缩率深度伪造检测的单模态生成多模态对比学习
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02606-0
Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu, Chiou-Ting Hsu, Chia-Wen Lin
In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.
在深度伪造检测中,社交媒体平台所采用的不同程度的压缩对模型的泛化和可靠性提出了重大挑战。虽然现有的方法已经从单模态发展到多模态,但它们面临着严重的局限性:单模态方法在社交媒体流数据压缩下存在特征退化问题,而多模态方法需要昂贵的数据收集和标记,并且在现实场景中存在模态质量不一致或可访问性不一致的问题。为了解决这些挑战,我们提出了一种新的单模生成的多模态对比学习(UMCL)框架,用于鲁棒交叉压缩率(CCR)深度假检测。在训练阶段,我们的方法将单个视觉模态转换为三个互补的特征:压缩鲁棒rPPG信号、时间地标动态和来自预训练视觉语言模型的语义嵌入。这些特征通过亲和驱动的语义对齐(ASA)策略明确对齐,该策略通过亲和矩阵建模多模态关系,并通过对比学习优化它们的一致性。随后,我们的跨质量相似学习(CQSL)策略增强了跨压缩率的特征鲁棒性。大量的实验表明,我们的方法在各种压缩率和操作类型中都取得了优异的性能,为鲁棒深度伪造检测建立了新的基准。值得注意的是,我们的方法即使在单个特征降级时也保持了很高的检测精度,同时通过显式对齐提供了对特征关系的可解释的见解。
{"title":"UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection","authors":"Ching-Yi Lai, Chih-Yu Jian, Pei-Cheng Chuang, Chia-Ming Lee, Chih-Chung Hsu, Chiou-Ting Hsu, Chia-Wen Lin","doi":"10.1007/s11263-025-02606-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02606-0","url":null,"abstract":"In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability. Although existing methods have progressed from single-modal to multimodal approaches, they face critical limitations: single-modal methods struggle with feature degradation under data compression in social media streaming, while multimodal approaches require expensive data collection and labeling and suffer from inconsistent modal quality or accessibility in real-world scenarios. To address these challenges, we propose a novel Unimodal-generated Multimodal Contrastive Learning (UMCL) framework for robust cross-compression-rate (CCR) deepfake detection. In the training stage, our approach transforms a single visual modality into three complementary features: compression-robust rPPG signals, temporal landmark dynamics, and semantic embeddings from pre-trained vision-language models. These features are explicitly aligned through an affinity-driven semantic alignment (ASA) strategy, which models inter-modal relationships through affinity matrices and optimizes their consistency through contrastive learning. Subsequently, our cross-quality similarity learning (CQSL) strategy enhances feature robustness across compression rates. Extensive experiments demonstrate that our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection. Notably, our approach maintains high detection accuracy even when individual features degrade, while providing interpretable insights into feature relationships through explicit alignment.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"125 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multiple Instance Learning Framework with Masked Hard Instance Mining for Gigapixel Histopathology Image Analysis 基于屏蔽硬实例挖掘的多实例学习框架用于十亿像素组织病理学图像分析
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02587-0
Wenhao Tang, Sheng Huang, Heng Fang, Fengtao Zhou, Bo Liu, Qingshan Liu
Digitizing pathological images into gigapixel Whole Slide Images (WSIs) has opened new avenues for Computational Pathology (CPath). As positive tissue comprises only a small fraction of gigapixel WSIs, existing Multiple Instance Learning (MIL) methods typically focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting challenging ones. Recent studies have shown that hard examples are crucial for accurately modeling discriminative boundaries. Applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which utilizes a Siamese structure with a consistency constraint to explore the hard instances. Using a class-aware instance probability, MHIM-MIL employs a momentum teacher to mask salient instances and implicitly mine hard instances for training the student model. To obtain diverse, non-redundant hard instances, we adopt large-scale random masking while utilizing a global recycle network to mitigate the risk of losing key features. Furthermore, the student updates the teacher using an exponential moving average, which identifies new hard instances for subsequent training iterations and stabilizes optimization. Experimental results on cancer diagnosis, subtyping, survival analysis tasks, and 12 benchmarks demonstrate that MHIM-MIL outperforms the latest methods in both performance and efficiency. The code is available at: https://github.com/DearCaat/MHIM-MIL.
数字化病理图像成十亿像素的全幻灯片图像(WSIs)为计算病理学(CPath)开辟了新的途径。由于阳性组织仅占十亿像素wsi的一小部分,现有的多实例学习(MIL)方法通常侧重于通过注意机制识别显著实例。然而,这会导致人们倾向于容易分类的实例,而忽略了具有挑战性的实例。最近的研究表明,硬示例对于准确建模判别边界至关重要。在实例级应用这一思想,我们设计了一种新的MIL框架,该框架采用屏蔽硬实例挖掘(mhm -MIL),该框架利用带有一致性约束的Siamese结构来探索硬实例。使用类感知实例概率,MHIM-MIL使用动量教师来掩盖突出实例并隐式挖掘困难实例来训练学生模型。为了获得多样化、非冗余的硬实例,我们采用大规模随机屏蔽,同时利用全局回收网络来降低丢失关键特征的风险。此外,学生使用指数移动平均更新老师,这为后续的训练迭代识别新的困难实例并稳定优化。在癌症诊断、分型、生存分析任务和12个基准测试方面的实验结果表明,mhm - mil在性能和效率方面都优于最新方法。代码可从https://github.com/DearCaat/MHIM-MIL获得。
{"title":"Multiple Instance Learning Framework with Masked Hard Instance Mining for Gigapixel Histopathology Image Analysis","authors":"Wenhao Tang, Sheng Huang, Heng Fang, Fengtao Zhou, Bo Liu, Qingshan Liu","doi":"10.1007/s11263-025-02587-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02587-0","url":null,"abstract":"Digitizing pathological images into gigapixel Whole Slide Images (WSIs) has opened new avenues for Computational Pathology (CPath). As positive tissue comprises only a small fraction of gigapixel WSIs, existing Multiple Instance Learning (MIL) methods typically focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting challenging ones. Recent studies have shown that hard examples are crucial for accurately modeling discriminative boundaries. Applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which utilizes a Siamese structure with a consistency constraint to explore the hard instances. Using a class-aware instance probability, MHIM-MIL employs a momentum teacher to mask salient instances and implicitly mine hard instances for training the student model. To obtain diverse, non-redundant hard instances, we adopt large-scale random masking while utilizing a global recycle network to mitigate the risk of losing key features. Furthermore, the student updates the teacher using an exponential moving average, which identifies new hard instances for subsequent training iterations and stabilizes optimization. Experimental results on cancer diagnosis, subtyping, survival analysis tasks, and 12 benchmarks demonstrate that MHIM-MIL outperforms the latest methods in both performance and efficiency. The code is available at: <ext-link ext-link-type=\"uri\" xlink:href=\"https://github.com/DearCaat/MHIM-MIL\">https://github.com/DearCaat/MHIM-MIL</ext-link>.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"42 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RDG-GS: Relative Depth Guidance with Gaussian Splatting for Real-time Sparse-View 3D Rendering RDG-GS:用于实时稀疏视图3D渲染的高斯溅射相对深度指导
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02594-1
Chenlu Zhan, Yufei Zhang, Yu Lin, Gaoang Wang, Hongwei Wang
Efficiently synthesizing novel views from sparse inputs while maintaining accuracy remains a critical challenge in 3D reconstruction. While advanced techniques like radiance fields and 3D Gaussian Splatting achieve rendering quality and impressive efficiency with dense view inputs, they suffer from significant geometric reconstruction errors when applied to sparse input views. Moreover, although recent methods leveraging monocular depth estimation to enhance geometric learning, their dependence on single-view estimated depth often leads to view inconsistency issues across different viewpoints. Consequently, this reliance on absolute depth can introduce inaccuracies in geometric information, ultimately compromising the quality of scene reconstruction with Gaussian splats. In this paper, we present RDG-GS, a novel sparse-view 3D rendering framework with Relative Depth Guidance based on 3D Gaussian Splatting. The core innovation lies in utilizing relative depth guidance to refine the Gaussian field, steering it towards view-consistent spatial geometric representations, thereby enabling the reconstruction of accurate geometric structures and capturing intricate textures. First, we devise refined depth priors to rectify the coarse estimated depth and insert global and fine-grained scene information into regular Gaussians. Building on this, to address spatial geometric inaccuracies from absolute depth, we propose relative depth guidance by optimizing the similarity between spatially correlated patches of depth and images. Additionally, we also directly deal with the sparse areas challenging to converge by the adaptive sampling for quick densification. Across extensive experiments on Mip-NeRF360, LLFF, DTU, and Blender, RDG-GS demonstrates state-of-the-art rendering quality and efficiency, making a significant advancement for real-world applications.
有效地从稀疏输入合成新视图,同时保持精度仍然是三维重建的关键挑战。虽然像辐射场和3D高斯飞溅这样的先进技术在密集视图输入下实现了渲染质量和令人印象深刻的效率,但当应用于稀疏输入视图时,它们会遭受显著的几何重建误差。此外,尽管最近的方法利用单视角深度估计来增强几何学习,但它们对单视角估计深度的依赖往往导致不同视点之间的视图不一致问题。因此,这种对绝对深度的依赖可能会引入几何信息的不准确性,最终影响高斯拼板场景重建的质量。在本文中,我们提出了一种新的基于三维高斯溅射的具有相对深度引导的稀疏视图三维绘制框架RDG-GS。核心创新在于利用相对深度引导对高斯场进行细化,将其转向与视图一致的空间几何表示,从而能够重建精确的几何结构并捕获复杂的纹理。首先,我们设计了精细的深度先验来校正粗糙的估计深度,并将全局和细粒度的场景信息插入正则高斯分布中。在此基础上,为了解决绝对深度的空间几何不准确性,我们提出了通过优化空间相关深度斑块和图像之间的相似性来实现相对深度引导的方法。此外,我们还通过自适应采样直接处理难以收敛的稀疏区域,实现快速致密化。在Mip-NeRF360, LLFF, DTU和Blender上进行了广泛的实验,RDG-GS展示了最先进的渲染质量和效率,为现实世界的应用程序取得了重大进展。
{"title":"RDG-GS: Relative Depth Guidance with Gaussian Splatting for Real-time Sparse-View 3D Rendering","authors":"Chenlu Zhan, Yufei Zhang, Yu Lin, Gaoang Wang, Hongwei Wang","doi":"10.1007/s11263-025-02594-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02594-1","url":null,"abstract":"Efficiently synthesizing novel views from sparse inputs while maintaining accuracy remains a critical challenge in 3D reconstruction. While advanced techniques like radiance fields and 3D Gaussian Splatting achieve rendering quality and impressive efficiency with dense view inputs, they suffer from significant geometric reconstruction errors when applied to sparse input views. Moreover, although recent methods leveraging monocular depth estimation to enhance geometric learning, their dependence on single-view estimated depth often leads to view inconsistency issues across different viewpoints. Consequently, this reliance on absolute depth can introduce inaccuracies in geometric information, ultimately compromising the quality of scene reconstruction with Gaussian splats. In this paper, we present <bold>RDG-GS</bold>, a novel sparse-view 3D rendering framework with <bold>R</bold>elative <bold>D</bold>epth <bold>G</bold>uidance based on 3D <bold>G</bold>aussian <bold>S</bold>platting. The core innovation lies in utilizing relative depth guidance to refine the Gaussian field, steering it towards view-consistent spatial geometric representations, thereby enabling the reconstruction of accurate geometric structures and capturing intricate textures. First, we devise refined depth priors to rectify the coarse estimated depth and insert global and fine-grained scene information into regular Gaussians. Building on this, to address spatial geometric inaccuracies from absolute depth, we propose relative depth guidance by optimizing the similarity between spatially correlated patches of depth and images. Additionally, we also directly deal with the sparse areas challenging to converge by the adaptive sampling for quick densification. Across extensive experiments on Mip-NeRF360, LLFF, DTU, and Blender, RDG-GS demonstrates state-of-the-art rendering quality and efficiency, making a significant advancement for real-world applications.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"13 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Liquid: Language Models are Scalable and Unified Multi-Modal Generators 液态:语言模型是可扩展的和统一的多模态生成器
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02639-5
Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai
We present Liquid, a versatile and native auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using any existing large language models (LLMs), eliminating the need for external pretrained visual modules such as CLIP and diffusion models. For the first time, Liquid reveals that the power-law scaling laws of unified multimodal models align with those observed in language models, and it discovers that the trade-offs between visual and language tasks diminish as model size increases. Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other, effectively removing the typical interference seen in earlier models. We demonstrate that existing LLMs can serve as strong foundations for Liquid, saving training costs by 100times while surpassing Chameleon in multimodal capabilities. Compared to previous unified multimodal models, Liquid maintains on-par language performance comparable to mainstream LLMs like Llama2, preserving its potential as a foundational model. Building on this foundation, Liquid outperforms visual generation models like SD v2.1 and SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and text-only tasks. The code and models are available at https://github.com/FoundationVision/Liquid.
我们提出了Liquid,一个通用的本地自动回归生成范例,它通过将图像标记为离散代码并在视觉和语言的共享特征空间中学习这些代码嵌入和文本标记来无缝集成视觉理解和生成。与之前的多模态大型语言模型(MLLM)不同,Liquid使用任何现有的大型语言模型(llm)来实现这种集成,从而消除了对外部预训练视觉模块(如CLIP和扩散模型)的需求。Liquid首次揭示了统一多模态模型的幂律缩放定律与语言模型中观察到的幂律缩放定律一致,并且发现视觉和语言任务之间的权衡随着模型大小的增加而减少。此外,统一的标记空间使视觉生成和理解任务相互增强,有效地消除了早期模型中常见的典型干扰。我们证明,现有的llm可以作为Liquid的坚实基础,节省培训成本100倍,同时在多式联运能力方面超过变色龙。与之前的统一多模态模型相比,Liquid保持了与Llama2等主流llm相当的语言性能,保留了其作为基础模型的潜力。在此基础上,Liquid优于SD v2.1和SD- xl (MJHQ-30K上的FID为5.47)等视觉生成模型,在视觉语言和纯文本任务中都表现出色。代码和模型可在https://github.com/FoundationVision/Liquid上获得。
{"title":"Liquid: Language Models are Scalable and Unified Multi-Modal Generators","authors":"Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai","doi":"10.1007/s11263-025-02639-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02639-5","url":null,"abstract":"We present Liquid, a versatile and native auto-regressive generation paradigm that seamlessly integrates visual comprehension and generation by tokenizing images into discrete codes and learning these code embeddings alongside text tokens within a shared feature space for both vision and language. Unlike previous multimodal large language model (MLLM), Liquid achieves this integration using any existing large language models (LLMs), eliminating the need for external pretrained visual modules such as CLIP and diffusion models. For the first time, Liquid reveals that the power-law scaling laws of unified multimodal models align with those observed in language models, and it discovers that the trade-offs between visual and language tasks diminish as model size increases. Furthermore, the unified token space enables visual generation and comprehension tasks to mutually enhance each other, effectively removing the typical interference seen in earlier models. We demonstrate that existing LLMs can serve as strong foundations for Liquid, saving training costs by 100<italic>times</italic> while surpassing Chameleon in multimodal capabilities. Compared to previous unified multimodal models, Liquid maintains on-par language performance comparable to mainstream LLMs like Llama2, preserving its potential as a foundational model. Building on this foundation, Liquid outperforms visual generation models like SD v2.1 and SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and text-only tasks. The code and models are available at <ext-link ext-link-type=\"uri\" xlink:href=\"https://github.com/FoundationVision/Liquid\">https://github.com/FoundationVision/Liquid</ext-link>.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Concept-Based Explanation for Deep Vision Models: A Comprehensive Survey on Techniques, Taxonomy, Applications, and Recent Advances 基于概念的深度视觉模型解释:技术、分类、应用和最新进展综述
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-01-01 DOI: 10.1007/s11263-025-02647-5
Razan Alharith, Jiashu Zhang, Ashraf Osman Ibrahim, Zhenyu Wu
Concept-based explanation represents an important yet rapidly evolving method aimed at enhancing the interpretability and transparency of deep learning models by clarifying their behaviors and predictions using understandable concepts. However, the current literature lacks a comprehensive survey and classification of the various strategies and methodologies employed to analyze these models. This paper aims to fill this gap by introducing a new taxonomy of concept-based explanation strategies. Following a thorough review of 101 relevant studies, a preliminary taxonomy was developed that groups strategies based on criteria such as data modality, level of supervision, model complexity, explanation scope, and model interpretability. Furthermore, we present a comprehensive evaluation of the advantages and limitations of various methodologies, as well as the datasets commonly used in this field. We also identify promising avenues for further exploration. Our study aims to serve as a useful tool for researchers and professionals interested in advancing concept-based explanation. Furthermore, we have built a GitHub project page that gathers key materials for concept-based explanations, which may be accessible through : https://github.com/razanalharith/Concept-Based-Explanation.
基于概念的解释是一种重要而快速发展的方法,旨在通过使用可理解的概念来澄清深度学习模型的行为和预测,从而提高深度学习模型的可解释性和透明度。然而,目前的文献缺乏对分析这些模型所采用的各种策略和方法的全面调查和分类。本文旨在通过引入一种新的基于概念的解释策略分类法来填补这一空白。在对101项相关研究进行全面回顾之后,我们开发了一个初步的分类法,该分类法根据数据模式、监督水平、模型复杂性、解释范围和模型可解释性等标准对策略进行分组。此外,我们对各种方法的优点和局限性以及该领域常用的数据集进行了全面的评估。我们还确定了进一步探索的有希望的途径。我们的研究旨在为有兴趣推进基于概念的解释的研究人员和专业人士提供有用的工具。此外,我们已经建立了一个GitHub项目页面,收集了基于概念的解释的关键材料,可以通过https://github.com/razanalharith/Concept-Based-Explanation访问。
{"title":"Concept-Based Explanation for Deep Vision Models: A Comprehensive Survey on Techniques, Taxonomy, Applications, and Recent Advances","authors":"Razan Alharith, Jiashu Zhang, Ashraf Osman Ibrahim, Zhenyu Wu","doi":"10.1007/s11263-025-02647-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02647-5","url":null,"abstract":"Concept-based explanation represents an important yet rapidly evolving method aimed at enhancing the interpretability and transparency of deep learning models by clarifying their behaviors and predictions using understandable concepts. However, the current literature lacks a comprehensive survey and classification of the various strategies and methodologies employed to analyze these models. This paper aims to fill this gap by introducing a new taxonomy of concept-based explanation strategies. Following a thorough review of 101 relevant studies, a preliminary taxonomy was developed that groups strategies based on criteria such as data modality, level of supervision, model complexity, explanation scope, and model interpretability. Furthermore, we present a comprehensive evaluation of the advantages and limitations of various methodologies, as well as the datasets commonly used in this field. We also identify promising avenues for further exploration. Our study aims to serve as a useful tool for researchers and professionals interested in advancing concept-based explanation. Furthermore, we have built a GitHub project page that gathers key materials for concept-based explanations, which may be accessible through : <ext-link ext-link-type=\"uri\" xlink:href=\"https://github.com/razanalharith/Concept-Based-Explanation\">https://github.com/razanalharith/Concept-Based-Explanation</ext-link>.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"30 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1