脑流：多模态引导下的 fMRI 图像重构

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI:arxiv-2409.12099

Jaehoon Joo, Taejin Jeong, Seongjae Hwang

{"title":"脑流：多模态引导下的 fMRI 图像重构","authors":"Jaehoon Joo, Taejin Jeong, Seongjae Hwang","doi":"arxiv-2409.12099","DOIUrl":null,"url":null,"abstract":"Understanding how humans process visual information is one of the crucial\nsteps for unraveling the underlying mechanism of brain activity. Recently, this\ncuriosity has motivated the fMRI-to-image reconstruction task; given the fMRI\ndata from visual stimuli, it aims to reconstruct the corresponding visual\nstimuli. Surprisingly, leveraging powerful generative models such as the Latent\nDiffusion Model (LDM) has shown promising results in reconstructing complex\nvisual stimuli such as high-resolution natural images from vision datasets.\nDespite the impressive structural fidelity of these reconstructions, they often\nlack details of small objects, ambiguous shapes, and semantic nuances.\nConsequently, the incorporation of additional semantic knowledge, beyond mere\nvisuals, becomes imperative. In light of this, we exploit how modern LDMs\neffectively incorporate multi-modal guidance (text guidance, visual guidance,\nand image layout) for structurally and semantically plausible image\ngenerations. Specifically, inspired by the two-streams hypothesis suggesting\nthat perceptual and semantic information are processed in different brain\nregions, our framework, Brain-Streams, maps fMRI signals from these brain\nregions to appropriate embeddings. That is, by extracting textual guidance from\nsemantic information regions and visual guidance from perceptual information\nregions, Brain-Streams provides accurate multi-modal guidance to LDMs. We\nvalidate the reconstruction ability of Brain-Streams both quantitatively and\nqualitatively on a real fMRI dataset comprising natural image stimuli and fMRI\ndata.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance\",\"authors\":\"Jaehoon Joo, Taejin Jeong, Seongjae Hwang\",\"doi\":\"arxiv-2409.12099\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Understanding how humans process visual information is one of the crucial\\nsteps for unraveling the underlying mechanism of brain activity. Recently, this\\ncuriosity has motivated the fMRI-to-image reconstruction task; given the fMRI\\ndata from visual stimuli, it aims to reconstruct the corresponding visual\\nstimuli. Surprisingly, leveraging powerful generative models such as the Latent\\nDiffusion Model (LDM) has shown promising results in reconstructing complex\\nvisual stimuli such as high-resolution natural images from vision datasets.\\nDespite the impressive structural fidelity of these reconstructions, they often\\nlack details of small objects, ambiguous shapes, and semantic nuances.\\nConsequently, the incorporation of additional semantic knowledge, beyond mere\\nvisuals, becomes imperative. In light of this, we exploit how modern LDMs\\neffectively incorporate multi-modal guidance (text guidance, visual guidance,\\nand image layout) for structurally and semantically plausible image\\ngenerations. Specifically, inspired by the two-streams hypothesis suggesting\\nthat perceptual and semantic information are processed in different brain\\nregions, our framework, Brain-Streams, maps fMRI signals from these brain\\nregions to appropriate embeddings. That is, by extracting textual guidance from\\nsemantic information regions and visual guidance from perceptual information\\nregions, Brain-Streams provides accurate multi-modal guidance to LDMs. We\\nvalidate the reconstruction ability of Brain-Streams both quantitatively and\\nqualitatively on a real fMRI dataset comprising natural image stimuli and fMRI\\ndata.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.12099\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

了解人类如何处理视觉信息是揭示大脑活动内在机制的关键步骤之一。最近，这种好奇心激发了从 fMRI 到图像的重建任务；给定来自视觉刺激的 fMRI 数据，其目的是重建相应的视觉刺激。令人惊讶的是，利用强大的生成模型，如潜在扩散模型（LatentDiffusion Model，LDM），在从视觉数据集重建复杂视觉刺激（如高分辨率自然图像）方面取得了令人鼓舞的成果。尽管这些重建的结构保真度令人印象深刻，但它们往往缺乏小物体、模糊形状和语义细微差别的细节。有鉴于此，我们探讨了现代 LDM 如何有效地结合多模式引导（文本引导、视觉引导和图像布局），以生成结构和语义上合理的图像。具体来说，双流假说认为感知信息和语义信息在不同的脑区进行处理，受此启发，我们的框架 "脑流"（Brain-Streams）将这些脑区的 fMRI 信号映射到适当的嵌入中。也就是说，通过从语义信息区域提取文本引导，从感知信息区域提取视觉引导，Brain-Streams 可为 LDM 提供准确的多模态引导。我们在由自然图像刺激和fMRI数据组成的真实fMRI数据集上对Brain-Streams的重构能力进行了定量和定性验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Brain-Streams: fMRI-to-Image Reconstruction with Multi-modal Guidance

Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助