首页 > 最新文献

Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition最新文献

英文 中文
VIDHALLUC: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding. 在视频理解的多模态大语言模型中评估时间幻觉。
Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI: 10.1109/cvpr52734.2025.01281
Chaoyu Li, Eun Woo Im, Pooyan Fazli

Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VIDHALLUC, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VIDHALLUC assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a trainingfree method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VIDHALLUC, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VIDHALLUC benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.

多模态大型语言模型(mllm)最近在视频理解方面取得了重大进展,在内容推理和指令跟随任务方面表现出色。然而,幻觉,即模型产生不准确或误导性的内容,在视频领域仍未得到充分研究。基于对MLLM视觉编码器经常无法区分视觉上不同但语义上相似的视频对的观察,我们介绍了VIDHALLUC,这是设计用于检查MLLM中用于视频理解的幻觉的最大基准。它由5002个视频组成,以突出容易产生幻觉的案例。VIDHALLUC从三个关键维度评估幻觉:(1)动作,(2)时间序列,(3)场景转换。综合测试表明,大多数mlm在这些维度上容易产生幻觉。此外,我们提出了DINO-HEAL,这是一种无需训练的方法,通过在推理过程中结合DINOv2的空间显著性来重新加权视觉特征,从而减少幻觉。我们的研究结果表明,DINO-HEAL持续提高了VIDHALLUC的性能,在所有任务中减轻幻觉的平均提高了3.02%。VIDHALLUC基准测试和DINO-HEAL代码都可以在https://people-robots.github.io/vidhalluc上获得。
{"title":"VIDHALLUC: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding.","authors":"Chaoyu Li, Eun Woo Im, Pooyan Fazli","doi":"10.1109/cvpr52734.2025.01281","DOIUrl":"10.1109/cvpr52734.2025.01281","url":null,"abstract":"<p><p>Multimodal large language models (MLLMs) have recently shown significant advancements in video understanding, excelling in content reasoning and instruction-following tasks. However, hallucination, where models generate inaccurate or misleading content, remains underexplored in the video domain. Building on the observation that MLLM visual encoders often fail to distinguish visually different yet semantically similar video pairs, we introduce VIDHALLUC, the largest benchmark designed to examine hallucinations in MLLMs for video understanding. It consists of 5,002 videos, paired to highlight cases prone to hallucinations. VIDHALLUC assesses hallucinations across three critical dimensions: (1) action, (2) temporal sequence, and (3) scene transition. Comprehensive testing shows that most MLLMs are vulnerable to hallucinations across these dimensions. Furthermore, we propose DINO-HEAL, a trainingfree method that reduces hallucinations by incorporating spatial saliency from DINOv2 to reweight visual features during inference. Our results show that DINO-HEAL consistently improves performance on VIDHALLUC, achieving an average improvement of 3.02% in mitigating hallucinations across all tasks. Both the VIDHALLUC benchmark and DINO-HEAL code are available at https://people-robots.github.io/vidhalluc.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"13723-13733"},"PeriodicalIF":0.0,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12408113/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145002144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding. 看得远而清楚:用注意因果解码减轻mlm的幻觉。
Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI: 10.1109/cvpr52734.2025.02435
Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, Wenxue Li, Yulong Li, Wenxuan Song, Shiyan Su, Wei Feng, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, Zongyuan Ge

Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.

多模态大语言模型(mllm)的最新进展显著提高了视觉问答的性能。然而,他们经常会产生幻觉。在这项工作中,幻觉被分为两种主要类型:初始幻觉和雪球幻觉。我们认为足够的上下文信息可以直接从令牌交互过程中提取。受解码策略中因果推理的启发,我们提出利用因果掩模在多模态令牌之间建立信息传播。假设是这些标记之间的交互不足可能导致模型依赖于异常标记,忽略了密集和丰富的上下文线索。因此,我们建议通过处理异常值令牌来干预传播过程,以增强上下文推断。为此,我们提出了FarSight,这是一种通用的即插即用解码策略,仅通过优化因果掩码来减少异常标记的注意力干扰。我们方法的核心是有效的令牌传播。我们在因果掩模的上三角矩阵内设计了一个注意力注册结构,动态分配注意力以捕获转移到离群标记的注意力。此外,提出了一种掩蔽率递减的位置感知编码方法,使模型能够进一步关注前面的标记,特别是对于视频序列任务。通过大量的实验,FarSight在不同的mllm图像和视频基准测试中显示了显着的减轻幻觉的性能,证明了其有效性。
{"title":"Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding.","authors":"Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zile Huang, Haochen Xue, Ziyang Chen, Zelin Peng, Zhiwei Yang, Sijin Zhou, Wenxue Li, Yulong Li, Wenxuan Song, Shiyan Su, Wei Feng, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, Zongyuan Ge","doi":"10.1109/cvpr52734.2025.02435","DOIUrl":"10.1109/cvpr52734.2025.02435","url":null,"abstract":"<p><p>Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"26147-26159"},"PeriodicalIF":0.0,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12425127/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145066712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency. 冒险家:优化视觉曼巴架构设计的效率。
Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI: 10.1109/cvpr52734.2025.02807
Feng Wang, Timing Yang, Yaodong Yu, Sucheng Ren, Guoyizhe Wei, Angtian Wang, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

In this work, we introduce the Adventurer series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.

在这项工作中,我们引入了冒险者系列模型,其中我们将图像视为补丁令牌序列,并使用单向语言模型来学习视觉表示。这种建模范式允许我们以相对于序列长度的线性复杂性的循环公式处理图像,这可以有效地解决高分辨率和细粒度图像带来的内存和计算爆炸问题。详细地说,我们介绍了两种简单的设计,将图像输入无缝地集成到因果推理框架中:放置在序列开头的全局池化令牌和每两层之间的翻转操作。大量的实证研究强调,与现有的简单架构(如DeiT[46]和Vim[57])相比,Adventurer提供了最佳的效率和准确性权衡。例如,我们的explorer - base在标准ImageNet-1k基准上以216张图像/s的训练吞吐量达到了84.3%的竞争测试精度,比Vim和DeiT快3.8倍和6.2倍,以达到相同的结果。由于Adventurer提供了巨大的计算和内存效率,并允许线性复杂性的扩展,我们希望这种架构可以在未来探索高分辨率或细粒度图像的长序列建模。代码可从https://github.com/wangf3014/Adventurer获得。
{"title":"Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency.","authors":"Feng Wang, Timing Yang, Yaodong Yu, Sucheng Ren, Guoyizhe Wei, Angtian Wang, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie","doi":"10.1109/cvpr52734.2025.02807","DOIUrl":"10.1109/cvpr52734.2025.02807","url":null,"abstract":"<p><p>In this work, we introduce the <b>Adventurer</b> series models where we treat images as sequences of patch tokens and employ uni-directional language models to learn visual representations. This modeling paradigm allows us to process images in a recurrent formulation with linear complexity relative to the sequence length, which can effectively address the memory and computation explosion issues posed by high-resolution and fine-grained images. In detail, we introduce two simple designs that seamlessly integrate image inputs into the causal inference framework: a global pooling token placed at the beginning of the sequence and a flipping operation between every two layers. Extensive empirical studies highlight that compared with the existing plain architectures such as DeiT [46] and Vim [57], Adventurer offers an optimal efficiency-accuracy trade-off. For example, our Adventurer-Base attains a competitive test accuracy of 84.3% on the standard ImageNet-1k benchmark with 216 images/s training throughput, which is 3.8× and 6.2× faster than Vim and DeiT to achieve the same result. As Adventurer offers great computation and memory efficiency and allows scaling with linear complexity, we hope this architecture can benefit future explorations in modeling long sequences for high-resolution or fine-grained images. Code is available at https://github.com/wangf3014/Adventurer.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"30157-30166"},"PeriodicalIF":0.0,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12574601/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145433002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model. TopoCellGen:用扩散模型生成组织病理学细胞拓扑。
Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI: 10.1109/cvpr52734.2025.01954
Meilong Xu, Saumya Gupta, Xiaoling Hu, Chen Li, Shahira Abousamra, Dimitris Samaras, Prateek Prasanna, Chao Chen

Accurately modeling multi-class cell topology is crucial in digital pathology, as it provides critical insights into tissue structure and pathology. The synthetic generation of cell topology enables realistic simulations of complex tissue environments, enhances downstream tasks by augmenting training data, aligns more closely with pathologists' domain knowledge, and offers new opportunities for controlling and generalizing the tumor microenvironment. In this paper, we propose a novel approach that integrates topological constraints into a diffusion model to improve the generation of realistic, contextually accurate cell topologies. Our method refines the simulation of cell distributions and interactions, increasing the precision and interpretability of results in downstream tasks such as cell detection and classification. To assess the topological fidelity of generated layouts, we introduce a new metric, Topological Fréchet Distance (TopoFD), which overcomes the limitations of traditional metrics like FID in evaluating topological structure. Experimental results demonstrate the effectiveness of our approach in generating multi-class cell layouts that capture intricate topological relationships. Code is available at https://github.com/Melon-Xu/TopoCellGen.

准确地建模多类细胞拓扑在数字病理学中是至关重要的,因为它提供了对组织结构和病理的关键见解。细胞拓扑的合成生成能够实现复杂组织环境的真实模拟,通过增加训练数据增强下游任务,更紧密地与病理学家的领域知识保持一致,并为控制和推广肿瘤微环境提供了新的机会。在本文中,我们提出了一种新的方法,将拓扑约束集成到扩散模型中,以改进现实的、上下文准确的细胞拓扑的生成。我们的方法改进了细胞分布和相互作用的模拟,提高了下游任务(如细胞检测和分类)结果的精度和可解释性。为了评估生成的布局的拓扑保真度,我们引入了一种新的度量拓扑结构距离(TopoFD),它克服了FID等传统度量在评估拓扑结构时的局限性。实验结果证明了我们的方法在生成捕获复杂拓扑关系的多类单元布局方面的有效性。代码可从https://github.com/Melon-Xu/TopoCellGen获得。
{"title":"<i>TopoCellGen</i>: Generating Histopathology Cell Topology with a Diffusion Model.","authors":"Meilong Xu, Saumya Gupta, Xiaoling Hu, Chen Li, Shahira Abousamra, Dimitris Samaras, Prateek Prasanna, Chao Chen","doi":"10.1109/cvpr52734.2025.01954","DOIUrl":"https://doi.org/10.1109/cvpr52734.2025.01954","url":null,"abstract":"<p><p>Accurately modeling multi-class cell topology is crucial in digital pathology, as it provides critical insights into tissue structure and pathology. The synthetic generation of cell topology enables realistic simulations of complex tissue environments, enhances downstream tasks by augmenting training data, aligns more closely with pathologists' domain knowledge, and offers new opportunities for controlling and generalizing the tumor microenvironment. In this paper, we propose a novel approach that integrates topological constraints into a diffusion model to improve the generation of realistic, contextually accurate cell topologies. Our method refines the simulation of cell distributions and interactions, increasing the precision and interpretability of results in downstream tasks such as cell detection and classification. To assess the topological fidelity of generated layouts, we introduce a new metric, Topological Fréchet Distance (TopoFD), which overcomes the limitations of traditional metrics like FID in evaluating topological structure. Experimental results demonstrate the effectiveness of our approach in generating multi-class cell layouts that capture intricate topological relationships. Code is available at https://github.com/Melon-Xu/TopoCellGen.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"20979-20989"},"PeriodicalIF":0.0,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12380007/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? 视频作文:mlms可以分析编译视频中的作文吗?
Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI: 10.1109/CVPR52734.2025.00794
Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu

The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multi-modal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. Our benchmark is publicly available at https://yunlong10.github.io/VidComposition/.

多模态大语言模型(Multimodal Large Language Models, mllm)的发展使多模态理解取得了重大进展,扩展了它们分析视频内容的能力。然而,现有的mllm评估基准主要集中在抽象的视频理解上,缺乏对其理解视频组成的能力的详细评估,以及对视觉元素如何在高度编译的视频环境中组合和交互的细致解释。我们介绍了VidComposition,这是一个专门设计的新基准,用于评估mllm使用精心策划的编译视频和电影级注释的视频构图理解能力。VidComposition包括982个视频,1706个选择题,涵盖了镜头运动、角度、镜头大小、叙事结构、人物动作和情感等各个构图方面。我们对33个开源和专有mlm的综合评估揭示了人类和模型能力之间的显著性能差距。这突出了当前mlm在理解复杂、编译视频组合方面的局限性,并提供了进一步改进的领域。我们的基准可以在https://yunlong10.github.io/VidComposition/上公开获得。
{"title":"VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?","authors":"Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu","doi":"10.1109/CVPR52734.2025.00794","DOIUrl":"10.1109/CVPR52734.2025.00794","url":null,"abstract":"<p><p>The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multi-modal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. Our benchmark is publicly available at https://yunlong10.github.io/VidComposition/.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"8490-8500"},"PeriodicalIF":0.0,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12413207/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145016882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization. 通过流体驱动的异常随机化解开正常解剖。
Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI: 10.1109/cvpr52734.2025.00978
Peirong Liu, Ana Lawry Aguila, Juan E Iglesias

Data-driven machine learning has made significant strides in medical image analysis. However, most existing methods are tailored to specific modalities and assume a particular resolution (often isotropic). This limits their generalizability in clinical settings, where variations in scan appearance arise from differences in sequence parameters, resolution, and orientation. Furthermore, most general-purpose models are designed for healthy subjects and suffer from performance degradation when pathology is present. We introduce UNA (Unraveling Normal Anatomy), the first modality-agnostic learning approach for normal brain anatomy reconstruction that can handle both healthy scans and cases with pathology. We propose a fluid-driven anomaly randomization method that generates an unlimited number of realistic pathology profiles on-the-fly. UNA is trained on a combination of synthetic and real data, and can be applied directly to real images with potential pathology without the need for fine-tuning. We demonstrate UNA's effectiveness in reconstructing healthy brain anatomy and showcase its direct application to anomaly detection, using both simulated and real images from 3D healthy and stroke datasets, including CT and MRI scans. By bridging the gap between healthy and diseased images, UNA enables the use of general-purpose models on diseased images, opening up new opportunities for large-scale analysis of uncurated clinical images in the presence of pathology. Code is available at https://github.com/peirong26/UNA.

数据驱动的机器学习在医学图像分析方面取得了重大进展。然而,大多数现有的方法都是针对特定的模式量身定制的,并假设一个特定的分辨率(通常是各向同性的)。这限制了它们在临床环境中的普遍性,在临床环境中,扫描外观的变化是由序列参数、分辨率和方向的差异引起的。此外,大多数通用模型是为健康受试者设计的,当出现病理时,它们的性能会下降。我们介绍了UNA(解开正常解剖),这是第一个对正常大脑解剖重建进行模态不可知的学习方法,可以处理健康扫描和病理病例。我们提出了一种流体驱动的异常随机化方法,可以实时生成无限数量的现实病理概况。UNA在合成数据和真实数据的组合上进行训练,可以直接应用于具有潜在病理的真实图像,而无需进行微调。我们展示了UNA在重建健康大脑解剖结构方面的有效性,并展示了它在异常检测方面的直接应用,使用了来自3D健康和中风数据集的模拟和真实图像,包括CT和MRI扫描。通过弥合健康图像和病变图像之间的差距,UNA能够对病变图像使用通用模型,为在存在病理的情况下对未经整理的临床图像进行大规模分析开辟了新的机会。代码可从https://github.com/peirong26/UNA获得。
{"title":"Unraveling Normal Anatomy via Fluid-Driven Anomaly Randomization.","authors":"Peirong Liu, Ana Lawry Aguila, Juan E Iglesias","doi":"10.1109/cvpr52734.2025.00978","DOIUrl":"https://doi.org/10.1109/cvpr52734.2025.00978","url":null,"abstract":"<p><p>Data-driven machine learning has made significant strides in medical image analysis. However, most existing methods are tailored to specific modalities and assume a particular resolution (often isotropic). This limits their generalizability in clinical settings, where variations in scan appearance arise from differences in sequence parameters, resolution, and orientation. Furthermore, most general-purpose models are designed for healthy subjects and suffer from performance degradation when pathology is present. We introduce UNA (Unraveling Normal Anatomy), the first modality-agnostic learning approach for normal brain anatomy reconstruction that can handle both healthy scans and cases with pathology. We propose a fluid-driven anomaly randomization method that generates an unlimited number of realistic pathology profiles on-the-fly. UNA is trained on a combination of synthetic and real data, and can be applied directly to real images with potential pathology without the need for fine-tuning. We demonstrate UNA's effectiveness in reconstructing healthy brain anatomy and showcase its direct application to anomaly detection, using both simulated and real images from 3D healthy and stroke datasets, including CT and MRI scans. By bridging the gap between healthy and diseased images, UNA enables the use of general-purpose models on diseased images, opening up new opportunities for large-scale analysis of uncurated clinical images in the presence of pathology. Code is available at https://github.com/peirong26/UNA.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"10455-10465"},"PeriodicalIF":0.0,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12376902/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images. MERGE:基于多层面的分层图的GNN,用于从整个切片组织病理学图像中预测基因表达。
Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI: 10.1109/cvpr52734.2025.01455
Aniruddha Ganguly, Debolina Chatterjee, Wentao Huang, Jie Zhang, Alisa Yurovsky, Travis Steele Johnson, Chao Chen

Recent advances in Spatial Transcriptomics (ST) pair histology images with spatially resolved gene expression profiles, enabling predictions of gene expression across different tissue locations based on image patches. This opens up new possibilities for enhancing whole slide image (WSI) prediction tasks with localized gene expression. However, existing methods fail to fully leverage the interactions between different tissue locations, which are crucial for accurate joint prediction. To address this, we introduce MERGE (Multi-faceted hiErarchical gRaph for Gene Expressions), which combines a multi-faceted hierarchical graph construction strategy with graph neural networks (GNN) to improve gene expression predictions from WSIs. By clustering tissue image patches based on both spatial and morphological features, and incorporating intra- and inter-cluster edges, our approach fosters interactions between distant tissue locations during GNN learning. As an additional contribution, we evaluate different data smoothing techniques that are necessary to mitigate artifacts in ST data, often caused by technical imperfections. We advocate for adopting gene-aware smoothing methods that are more biologically justified. Experimental results on gene expression prediction show that our GNN method outperforms state-of-the-art techniques across multiple metrics.

空间转录组学(ST)的最新进展将组织学图像与空间分辨率的基因表达谱配对,使基于图像补丁的不同组织位置的基因表达预测成为可能。这为增强局部基因表达的全幻灯片图像(WSI)预测任务开辟了新的可能性。然而,现有的方法未能充分利用不同组织位置之间的相互作用,这对于准确预测关节至关重要。为了解决这个问题,我们引入了MERGE (Multi-faceted hiErarchical gRaph for Gene expression),它将多层层次图构建策略与图神经网络(gRaph neural networks, GNN)相结合,以改进来自wsi的基因表达预测。通过基于空间和形态特征的组织图像斑块聚类,并结合簇内和簇间边缘,我们的方法在GNN学习过程中促进了远距离组织位置之间的相互作用。作为额外的贡献,我们评估了不同的数据平滑技术,这些技术对于减轻ST数据中的工件是必要的,通常是由技术缺陷引起的。我们提倡采用更具有生物学合理性的基因感知平滑方法。基因表达预测的实验结果表明,我们的GNN方法在多个指标上优于最先进的技术。
{"title":"MERGE: Multi-faceted Hierarchical Graph-based GNN for Gene Expression Prediction from Whole Slide Histopathology Images.","authors":"Aniruddha Ganguly, Debolina Chatterjee, Wentao Huang, Jie Zhang, Alisa Yurovsky, Travis Steele Johnson, Chao Chen","doi":"10.1109/cvpr52734.2025.01455","DOIUrl":"https://doi.org/10.1109/cvpr52734.2025.01455","url":null,"abstract":"<p><p>Recent advances in Spatial Transcriptomics (ST) pair histology images with spatially resolved gene expression profiles, enabling predictions of gene expression across different tissue locations based on image patches. This opens up new possibilities for enhancing whole slide image (WSI) prediction tasks with localized gene expression. However, existing methods fail to fully leverage the interactions between different tissue locations, which are crucial for accurate joint prediction. To address this, we introduce <b>MERGE</b> (Multi-faceted hiErarchical gRaph for Gene Expressions), which combines a multi-faceted hierarchical graph construction strategy with graph neural networks (GNN) to improve gene expression predictions from WSIs. By clustering tissue image patches based on both spatial and morphological features, and incorporating intra- and inter-cluster edges, our approach fosters interactions between distant tissue locations during GNN learning. As an additional contribution, we evaluate different data smoothing techniques that are necessary to mitigate artifacts in ST data, often caused by technical imperfections. We advocate for adopting gene-aware smoothing methods that are more biologically justified. Experimental results on gene expression prediction show that our GNN method outperforms state-of-the-art techniques across multiple metrics.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"15611-15620"},"PeriodicalIF":0.0,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12380040/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144981927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mamba-Reg: Vision Mamba Also Needs Registers. 曼巴-瑞格:视觉曼巴也需要注册。
Pub Date : 2025-06-01 Epub Date: 2025-08-13 DOI: 10.1109/CVPR52734.2025.01392
Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba-they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture Mamba®. Qualitative observations suggest, compared to vanilla Vision Mamba, Mamba®'s feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, Mamba®attains stronger performance and scales better. For example, on the ImageNet benchmark, our Mamba®-B attains 83.0% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size with 341M parameters, attaining competitive accuracies of 83.6% and 84.5% for 224×224 and 384×384 inputs, respectively. Additional validation on the downstream semantic segmentation task also supports Mamba®'s efficacy. Code is available at https://github.com/wangf3014/Mamba-Reg.

与视觉变形器类似,本文确定了在视觉曼巴的特征映射中也存在的工件。这些伪影,对应于出现在图像低信息背景区域的高规范标记,在视觉曼巴中表现得更加严重——它们普遍存在于小尺寸模型中,并在背景区域广泛激活。为了缓解这个问题,我们遵循先前的解决方案,将注册令牌引入Vision Mamba。为了更好地应对Mamba块的单向推理范式,引入了两个关键修改:1)在整个输入令牌序列中均匀插入寄存器,以及2)回收寄存器以进行最终决策预测。我们称这种新架构为曼巴®。定性观察表明,与香草视觉曼巴相比,曼巴®的特征图更清晰,更专注于语义上有意义的区域。在数量上,曼巴®获得更强的性能和规模更好。例如,在ImageNet基准测试中,我们的Mamba®-B的准确率达到83.0%,显著优于Vim-B的81.8%;此外,我们提供了第一次成功缩放到具有341M参数的大模型尺寸,对于224×224和384×384输入,分别获得了83.6%和84.5%的竞争精度。对下游语义分割任务的额外验证也支持Mamba®的功效。代码可从https://github.com/wangf3014/Mamba-Reg获得。
{"title":"Mamba-Reg: Vision Mamba Also Needs Registers.","authors":"Feng Wang, Jiahao Wang, Sucheng Ren, Guoyizhe Wei, Jieru Mei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie","doi":"10.1109/CVPR52734.2025.01392","DOIUrl":"10.1109/CVPR52734.2025.01392","url":null,"abstract":"<p><p>Similar to Vision Transformers, this paper identifies artifacts also present within the feature maps of Vision Mamba. These artifacts, corresponding to high-norm tokens emerging in low-information background areas of images, appear much more severe in Vision Mamba-they exist prevalently even with the tiny-sized model and activate extensively across background regions. To mitigate this issue, we follow the prior solution of introducing register tokens into Vision Mamba. To better cope with Mamba blocks' uni-directional inference paradigm, two key modifications are introduced: 1) evenly inserting registers throughout the input token sequence, and 2) recycling registers for final decision predictions. We term this new architecture Mamba<sup>®</sup>. Qualitative observations suggest, compared to vanilla Vision Mamba, Mamba<sup>®</sup>'s feature maps appear cleaner and more focused on semantically meaningful regions. Quantitatively, Mamba<sup>®</sup>attains stronger performance and scales better. For example, on the ImageNet benchmark, our Mamba<sup>®</sup>-B attains 83.0% accuracy, significantly outperforming Vim-B's 81.8%; furthermore, we provide the first successful scaling to the large model size with 341M parameters, attaining competitive accuracies of 83.6% and 84.5% for 224×224 and 384×384 inputs, respectively. Additional validation on the downstream semantic segmentation task also supports Mamba<sup>®</sup>'s efficacy. Code is available at https://github.com/wangf3014/Mamba-Reg.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2025 ","pages":"14944-14953"},"PeriodicalIF":0.0,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12700654/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145758626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering. 术中二维/三维图像配准的可微分x线渲染。
Pub Date : 2024-06-01 Epub Date: 2024-09-16 DOI: 10.1109/cvpr52733.2024.01108
Vivek Gopalakrishnan, Neel Dey, Polina Golland

Surgical decisions are informed by aligning rapid portable 2D intraoperative images (e.g. X-rays) to a high-fidelity 3D preoperative reference scan (e.g. CT). However, 2D/3D registration can often fail in practice: conventional optimization methods are prohibitively slow and susceptible to local minima, while neural networks trained on small datasets fail on new patients or require impractical landmark supervision. We present DiffPose, a self-supervised approach that leverages patient-specific simulation and differentiable physics-based rendering to achieve accurate 2D/3D registration without relying on manually labeled data. Preoperatively, a CNN is trained to regress the pose of a randomly oriented synthetic X-ray rendered from the preoperative CT. The CNN then initializes rapid intraoperative test-time optimization that uses the differentiable X-ray renderer to refine the solution. Our work further proposes several geometrically principled methods for sampling camera poses from SE ( 3 ) , for sparse differentiable rendering, and for driving registration in the tangent space se ( 3 ) with geodesic and multiscale locality-sensitive losses. DiffPose achieves sub-millimeter accuracy across surgical datasets at intraoperative speeds, improving upon existing unsupervised methods by an order of magnitude and even outperforming supervised baselines. Our implementation is at https://github.com/eigenvivek/DiffPose.

手术决定是通过将快速便携式2D术中图像(如x射线)与高保真3D术前参考扫描(如CT)对齐来获得的。然而,2D/3D配准在实践中经常会失败:传统的优化方法速度太慢,容易受到局部极小值的影响,而在小数据集上训练的神经网络在新患者上失败,或者需要不切实际的地标监督。我们提出了DiffPose,这是一种自我监督的方法,利用特定患者的模拟和基于可微分物理的渲染来实现精确的2D/3D注册,而不依赖于手动标记的数据。术前,训练CNN来回归术前CT呈现的随机定向合成x射线的姿态。然后,CNN初始化快速术中测试时间优化,使用可微x射线渲染器来优化解决方案。我们的工作进一步提出了几种几何原则的方法,用于从SE(3)中采样相机姿势,用于稀疏可微渲染,以及在具有测地线和多尺度位置敏感损失的切线空间SE(3)中进行驾驶配准。在术中速度下,DiffPose在手术数据集上实现了亚毫米级的精度,在现有的无监督方法的基础上提高了一个数量级,甚至优于有监督的基线。我们的实现在https://github.com/eigenvivek/DiffPose。
{"title":"Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering.","authors":"Vivek Gopalakrishnan, Neel Dey, Polina Golland","doi":"10.1109/cvpr52733.2024.01108","DOIUrl":"10.1109/cvpr52733.2024.01108","url":null,"abstract":"<p><p>Surgical decisions are informed by aligning rapid portable 2D intraoperative images (e.g. X-rays) to a high-fidelity 3D preoperative reference scan (e.g. CT). However, 2D/3D registration can often fail in practice: conventional optimization methods are prohibitively slow and susceptible to local minima, while neural networks trained on small datasets fail on new patients or require impractical landmark supervision. We present DiffPose, a self-supervised approach that leverages patient-specific simulation and differentiable physics-based rendering to achieve accurate 2D/3D registration without relying on manually labeled data. Preoperatively, a CNN is trained to regress the pose of a randomly oriented synthetic X-ray rendered from the preoperative CT. The CNN then initializes rapid intraoperative test-time optimization that uses the differentiable X-ray renderer to refine the solution. Our work further proposes several geometrically principled methods for sampling camera poses from <math><mrow><mi>SE</mi> <mo>(</mo> <mn>3</mn> <mo>)</mo></mrow> </math> , for sparse differentiable rendering, and for driving registration in the tangent space <math> <mrow><mstyle><mi>se</mi></mstyle> <mo>(</mo> <mn>3</mn> <mo>)</mo></mrow> </math> with geodesic and multiscale locality-sensitive losses. DiffPose achieves sub-millimeter accuracy across surgical datasets at intraoperative speeds, improving upon existing unsupervised methods by an order of magnitude and even outperforming supervised baselines. Our implementation is at https://github.com/eigenvivek/DiffPose.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2024 ","pages":"11662-11672"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12505627/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145260010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learned representation-guided diffusion models for large-image generation. 用于生成大型图像的学习表示引导扩散模型
Pub Date : 2024-06-01 Epub Date: 2024-09-16 DOI: 10.1109/cvpr52733.2024.00815
Alexandros Graikos, Srikar Yellapragada, Minh-Quan Le, Saarthak Kapse, Prateek Prasanna, Joel Saltz, Dimitris Samaras

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.

要合成高保真样本,扩散模型通常需要辅助数据来指导生成过程。然而,在组织病理学和卫星图像等专业领域,需要进行艰苦的斑块级注释工作,这是不现实的;注释工作通常由领域专家完成,涉及数以亿计的斑块。现代自监督学习(SSL)表征编码了丰富的语义和视觉信息。在本文中,我们认为这些表征具有足够的表现力,可以作为细粒度人类标签的代理。我们引入了一种新方法,以 SSL 的嵌入为条件训练扩散模型。我们的扩散模型成功地将这些特征投射回高质量的组织病理学和遥感图像。此外,我们还通过组合从 SSL 嵌入中推断出的空间一致性斑块来构建更大的图像,从而保留了长距离依赖关系。通过生成真实图像的变体来增强真实数据,提高了下游分类器对斑块级和更大图像级分类任务的准确性。我们的模型即使在训练过程中未遇到的数据集上也很有效,这证明了它们的鲁棒性和通用性。根据所学嵌入生成图像与嵌入的来源无关。用于生成大图像的 SSL 嵌入可以从参考图像中提取,也可以从任何相关模态(如类标签、文本、基因组数据)的辅助模型中采样。作为概念验证,我们引入了文本到大型图像合成范例,成功地从文本描述中合成了大型病理和卫星图像。
{"title":"Learned representation-guided diffusion models for large-image generation.","authors":"Alexandros Graikos, Srikar Yellapragada, Minh-Quan Le, Saarthak Kapse, Prateek Prasanna, Joel Saltz, Dimitris Samaras","doi":"10.1109/cvpr52733.2024.00815","DOIUrl":"10.1109/cvpr52733.2024.00815","url":null,"abstract":"<p><p>To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by assembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions.</p>","PeriodicalId":74560,"journal":{"name":"Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition","volume":"2024 ","pages":"8532-8542"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11601131/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142741465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1