The development of diffusion models has significantly advanced the research on image stylization, particularly in the area of stylizing a content image based on a given style image, which has attracted many scholars. The main challenge in this reference image stylization task lies in how to maintain the details of the content image while incorporating the color and texture features of the style image. This challenge becomes even more pronounced when the content image is a portrait which has complex textural details. To address this challenge, we propose a diffusion model-based reference image stylization method specifically for portraits, called MagicStyle. MagicStyle consists of two phases: Content and Style DDIM Inversion (CSDI) and Feature Fusion Forward (FFF). The CSDI phase involves a reverse denoising process, where DDIM Inversion is performed separately on the content image and the style image, storing the self-attention query, key and value features of both images during the inversion process. The FFF phase executes forward denoising, harmoniously integrating the texture and color information from the pre-stored feature queries, keys and values into the diffusion generation process based on our Well-designed Feature Fusion Attention (FFA). We conducted comprehensive comparative and ablation experiments to validate the effectiveness of our proposed MagicStyle and FFA.
扩散模型的发展极大地推动了图像风格化的研究,尤其是在基于给定风格图像的内容图像风格化领域,吸引了众多学者的关注。这种参考图像风格化任务的主要挑战在于如何在保持内容图像细节的同时融入风格图像的色彩和纹理特征。当内容图像是具有复杂纹理细节的肖像时,这一挑战就更加突出。为了解决这一难题,我们提出了一种基于扩散模型的参考图像风格化方法,专门用于人像,称为 MagicStyle。MagicStyle 包括两个阶段:内容与风格 DDIM 反转(CSDI)和特征前向融合(FFF)。CSDI 阶段涉及反向去噪过程,即分别对内容图像和风格图像进行 DDIM 反演,在反演过程中存储两幅图像的自注意力查询、关键特征和值特征。FFF 阶段执行前向去噪,根据我们精心设计的特征融合注意力(FFA),将预先存储的特征查询、键和值中的纹理和颜色信息和谐地整合到扩散生成过程中。我们进行了全面的比较和消融实验,以验证我们提出的 MagicStyle 和 FFA 的有效性。
{"title":"MagicStyle: Portrait Stylization Based on Reference Image","authors":"Zhaoli Deng, Kaibin Zhou, Fanyi Wang, Zhenpeng Mi","doi":"arxiv-2409.08156","DOIUrl":"https://doi.org/arxiv-2409.08156","url":null,"abstract":"The development of diffusion models has significantly advanced the research\u0000on image stylization, particularly in the area of stylizing a content image\u0000based on a given style image, which has attracted many scholars. The main\u0000challenge in this reference image stylization task lies in how to maintain the\u0000details of the content image while incorporating the color and texture features\u0000of the style image. This challenge becomes even more pronounced when the\u0000content image is a portrait which has complex textural details. To address this\u0000challenge, we propose a diffusion model-based reference image stylization\u0000method specifically for portraits, called MagicStyle. MagicStyle consists of\u0000two phases: Content and Style DDIM Inversion (CSDI) and Feature Fusion Forward\u0000(FFF). The CSDI phase involves a reverse denoising process, where DDIM\u0000Inversion is performed separately on the content image and the style image,\u0000storing the self-attention query, key and value features of both images during\u0000the inversion process. The FFF phase executes forward denoising, harmoniously\u0000integrating the texture and color information from the pre-stored feature\u0000queries, keys and values into the diffusion generation process based on our\u0000Well-designed Feature Fusion Attention (FFA). We conducted comprehensive\u0000comparative and ablation experiments to validate the effectiveness of our\u0000proposed MagicStyle and FFA.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, text-to-image generative models have been misused to create unauthorized malicious images of individuals, posing a growing social problem. Previous solutions, such as Anti-DreamBooth, add adversarial noise to images to protect them from being used as training data for malicious generation. However, we found that the adversarial noise can be removed by adversarial purification methods such as DiffPure. Therefore, we propose a new adversarial attack method that adds strong perturbation on the high-frequency areas of images to make it more robust to adversarial purification. Our experiment showed that the adversarial images retained noise even after adversarial purification, hindering malicious image generation.
{"title":"High-Frequency Anti-DreamBooth: Robust Defense Against Image Synthesis","authors":"Takuto Onikubo, Yusuke Matsui","doi":"arxiv-2409.08167","DOIUrl":"https://doi.org/arxiv-2409.08167","url":null,"abstract":"Recently, text-to-image generative models have been misused to create\u0000unauthorized malicious images of individuals, posing a growing social problem.\u0000Previous solutions, such as Anti-DreamBooth, add adversarial noise to images to\u0000protect them from being used as training data for malicious generation.\u0000However, we found that the adversarial noise can be removed by adversarial\u0000purification methods such as DiffPure. Therefore, we propose a new adversarial\u0000attack method that adds strong perturbation on the high-frequency areas of\u0000images to make it more robust to adversarial purification. Our experiment\u0000showed that the adversarial images retained noise even after adversarial\u0000purification, hindering malicious image generation.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang
While Text-to-Image (T2I) diffusion models excel at generating visually appealing images of individual instances, they struggle to accurately position and control the features generation of multiple instances. The Layout-to-Image (L2I) task was introduced to address the positioning challenges by incorporating bounding boxes as spatial control signals, but it still falls short in generating precise instance features. In response, we propose the Instance Feature Generation (IFG) task, which aims to ensure both positional accuracy and feature fidelity in generated instances. To address the IFG task, we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances feature depiction by incorporating additional appearance tokens and utilizing an Instance Semantic Map to align instance-level features with spatial locations. The IFAdapter guides the diffusion process as a plug-and-play module, making it adaptable to various community models. For evaluation, we contribute an IFG benchmark and develop a verification pipeline to objectively compare models' abilities to generate instances with accurate positioning and features. Experimental results demonstrate that IFAdapter outperforms other models in both quantitative and qualitative evaluations.
{"title":"IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation","authors":"Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang","doi":"arxiv-2409.08240","DOIUrl":"https://doi.org/arxiv-2409.08240","url":null,"abstract":"While Text-to-Image (T2I) diffusion models excel at generating visually\u0000appealing images of individual instances, they struggle to accurately position\u0000and control the features generation of multiple instances. The Layout-to-Image\u0000(L2I) task was introduced to address the positioning challenges by\u0000incorporating bounding boxes as spatial control signals, but it still falls\u0000short in generating precise instance features. In response, we propose the\u0000Instance Feature Generation (IFG) task, which aims to ensure both positional\u0000accuracy and feature fidelity in generated instances. To address the IFG task,\u0000we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances\u0000feature depiction by incorporating additional appearance tokens and utilizing\u0000an Instance Semantic Map to align instance-level features with spatial\u0000locations. The IFAdapter guides the diffusion process as a plug-and-play\u0000module, making it adaptable to various community models. For evaluation, we\u0000contribute an IFG benchmark and develop a verification pipeline to objectively\u0000compare models' abilities to generate instances with accurate positioning and\u0000features. Experimental results demonstrate that IFAdapter outperforms other\u0000models in both quantitative and qualitative evaluations.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"61 13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Inzamamul Alam, Muhammad Shahid Muneer, Simon S. Woo
In the wake of a fabricated explosion image at the Pentagon, an ability to discern real images from fake counterparts has never been more critical. Our study introduces a novel multi-modal approach to detect AI-generated images amidst the proliferation of new generation methods such as Diffusion models. Our method, UGAD, encompasses three key detection steps: First, we transform the RGB images into YCbCr channels and apply an Integral Radial Operation to emphasize salient radial features. Secondly, the Spatial Fourier Extraction operation is used for a spatial shift, utilizing a pre-trained deep learning network for optimal feature extraction. Finally, the deep neural network classification stage processes the data through dense layers using softmax for classification. Our approach significantly enhances the accuracy of differentiating between real and AI-generated images, as evidenced by a 12.64% increase in accuracy and 28.43% increase in AUC compared to existing state-of-the-art methods.
{"title":"UGAD: Universal Generative AI Detector utilizing Frequency Fingerprints","authors":"Inzamamul Alam, Muhammad Shahid Muneer, Simon S. Woo","doi":"arxiv-2409.07913","DOIUrl":"https://doi.org/arxiv-2409.07913","url":null,"abstract":"In the wake of a fabricated explosion image at the Pentagon, an ability to\u0000discern real images from fake counterparts has never been more critical. Our\u0000study introduces a novel multi-modal approach to detect AI-generated images\u0000amidst the proliferation of new generation methods such as Diffusion models.\u0000Our method, UGAD, encompasses three key detection steps: First, we transform\u0000the RGB images into YCbCr channels and apply an Integral Radial Operation to\u0000emphasize salient radial features. Secondly, the Spatial Fourier Extraction\u0000operation is used for a spatial shift, utilizing a pre-trained deep learning\u0000network for optimal feature extraction. Finally, the deep neural network\u0000classification stage processes the data through dense layers using softmax for\u0000classification. Our approach significantly enhances the accuracy of\u0000differentiating between real and AI-generated images, as evidenced by a 12.64%\u0000increase in accuracy and 28.43% increase in AUC compared to existing\u0000state-of-the-art methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Both manual (relating to the use of hands) and non-manual markers (NMM), such as facial expressions or mouthing cues, are important for providing the complete meaning of phrases in American Sign Language (ASL). Efforts have been made in advancing sign language to spoken/written language understanding, but most of these have primarily focused on manual features only. In this work, using advanced neural machine translation methods, we examine and report on the extent to which facial expressions contribute to understanding sign language phrases. We present a sign language translation architecture consisting of two-stream encoders, with one encoder handling the face and the other handling the upper body (with hands). We propose a new parallel cross-attention decoding mechanism that is useful for quantifying the influence of each input modality on the output. The two streams from the encoder are directed simultaneously to different attention stacks in the decoder. Examining the properties of the parallel cross-attention weights allows us to analyze the importance of facial markers compared to body and hand features during a translating task.
{"title":"Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis","authors":"Lipisha Chaudhary, Fei Xu, Ifeoma Nwogu","doi":"arxiv-2409.08162","DOIUrl":"https://doi.org/arxiv-2409.08162","url":null,"abstract":"Both manual (relating to the use of hands) and non-manual markers (NMM), such\u0000as facial expressions or mouthing cues, are important for providing the\u0000complete meaning of phrases in American Sign Language (ASL). Efforts have been\u0000made in advancing sign language to spoken/written language understanding, but\u0000most of these have primarily focused on manual features only. In this work,\u0000using advanced neural machine translation methods, we examine and report on the\u0000extent to which facial expressions contribute to understanding sign language\u0000phrases. We present a sign language translation architecture consisting of\u0000two-stream encoders, with one encoder handling the face and the other handling\u0000the upper body (with hands). We propose a new parallel cross-attention decoding\u0000mechanism that is useful for quantifying the influence of each input modality\u0000on the output. The two streams from the encoder are directed simultaneously to\u0000different attention stacks in the decoder. Examining the properties of the\u0000parallel cross-attention weights allows us to analyze the importance of facial\u0000markers compared to body and hand features during a translating task.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clustering artworks based on style has many potential real-world applications like art recommendations, style-based search and retrieval, and the study of artistic style evolution in an artwork corpus. However, clustering artworks based on style is largely an unaddressed problem. A few present methods for clustering artworks principally rely on generic image feature representations derived from deep neural networks and do not specifically deal with the artistic style. In this paper, we introduce and deliberate over the notion of style-based clustering of visual artworks. Our main objective is to explore neural feature representations and architectures that can be used for style-based clustering and observe their impact and effectiveness. We develop different methods and assess their relative efficacy for style-based clustering through qualitative and quantitative analysis by applying them to four artwork corpora and four curated synthetically styled datasets. Our analysis provides some key novel insights on architectures, feature representations, and evaluation methods suitable for style-based clustering.
{"title":"Style Based Clustering of Visual Artworks","authors":"Abhishek Dangeti, Pavan Gajula, Vivek Srivastava, Vikram Jamwal","doi":"arxiv-2409.08245","DOIUrl":"https://doi.org/arxiv-2409.08245","url":null,"abstract":"Clustering artworks based on style has many potential real-world applications\u0000like art recommendations, style-based search and retrieval, and the study of\u0000artistic style evolution in an artwork corpus. However, clustering artworks\u0000based on style is largely an unaddressed problem. A few present methods for\u0000clustering artworks principally rely on generic image feature representations\u0000derived from deep neural networks and do not specifically deal with the\u0000artistic style. In this paper, we introduce and deliberate over the notion of\u0000style-based clustering of visual artworks. Our main objective is to explore\u0000neural feature representations and architectures that can be used for\u0000style-based clustering and observe their impact and effectiveness. We develop\u0000different methods and assess their relative efficacy for style-based clustering\u0000through qualitative and quantitative analysis by applying them to four artwork\u0000corpora and four curated synthetically styled datasets. Our analysis provides\u0000some key novel insights on architectures, feature representations, and\u0000evaluation methods suitable for style-based clustering.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a new method for generating realistic datasets with distribution shifts using any decoder-based generative model. Our approach systematically creates datasets with varying intensities of distribution shifts, facilitating a comprehensive analysis of model performance degradation. We then use these generated datasets to evaluate the performance of various commonly used networks and observe a consistent decline in performance with increasing shift intensity, even when the effect is almost perceptually unnoticeable to the human eye. We see this degradation even when using data augmentations. We also find that enlarging the training dataset beyond a certain point has no effect on the robustness and that stronger inductive biases increase robustness.
{"title":"Control+Shift: Generating Controllable Distribution Shifts","authors":"Roy Friedman, Rhea Chowers","doi":"arxiv-2409.07940","DOIUrl":"https://doi.org/arxiv-2409.07940","url":null,"abstract":"We propose a new method for generating realistic datasets with distribution\u0000shifts using any decoder-based generative model. Our approach systematically\u0000creates datasets with varying intensities of distribution shifts, facilitating\u0000a comprehensive analysis of model performance degradation. We then use these\u0000generated datasets to evaluate the performance of various commonly used\u0000networks and observe a consistent decline in performance with increasing shift\u0000intensity, even when the effect is almost perceptually unnoticeable to the\u0000human eye. We see this degradation even when using data augmentations. We also\u0000find that enlarging the training dataset beyond a certain point has no effect\u0000on the robustness and that stronger inductive biases increase robustness.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim
Recent advancements in text-to-image diffusion models have demonstrated remarkable success, yet they often struggle to fully capture the user's intent. Existing approaches using textual inputs combined with bounding boxes or region masks fall short in providing precise spatial guidance, often leading to misaligned or unintended object orientation. To address these limitations, we propose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that utilizes simple user-provided scribbles as visual prompts to guide image generation. However, incorporating scribbles into diffusion models presents challenges due to their sparse and thin nature, making it difficult to ensure accurate orientation alignment. To overcome these challenges, we introduce moment alignment and scribble propagation, which allow for more effective and flexible alignment between generated images and scribble inputs. Experimental results on the PASCAL-Scribble dataset demonstrate significant improvements in spatial control and consistency, showcasing the effectiveness of scribble-based guidance in diffusion models. Our code is available at https://github.com/kaist-cvml-lab/scribble-diffusion.
{"title":"Scribble-Guided Diffusion for Training-free Text-to-Image Generation","authors":"Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim","doi":"arxiv-2409.08026","DOIUrl":"https://doi.org/arxiv-2409.08026","url":null,"abstract":"Recent advancements in text-to-image diffusion models have demonstrated\u0000remarkable success, yet they often struggle to fully capture the user's intent.\u0000Existing approaches using textual inputs combined with bounding boxes or region\u0000masks fall short in providing precise spatial guidance, often leading to\u0000misaligned or unintended object orientation. To address these limitations, we\u0000propose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that\u0000utilizes simple user-provided scribbles as visual prompts to guide image\u0000generation. However, incorporating scribbles into diffusion models presents\u0000challenges due to their sparse and thin nature, making it difficult to ensure\u0000accurate orientation alignment. To overcome these challenges, we introduce\u0000moment alignment and scribble propagation, which allow for more effective and\u0000flexible alignment between generated images and scribble inputs. Experimental\u0000results on the PASCAL-Scribble dataset demonstrate significant improvements in\u0000spatial control and consistency, showcasing the effectiveness of scribble-based\u0000guidance in diffusion models. Our code is available at\u0000https://github.com/kaist-cvml-lab/scribble-diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao
Recently, methods like Zero-1-2-3 have focused on single-view based 3D reconstruction and have achieved remarkable success. However, their predictions for unseen areas heavily rely on the inductive bias of large-scale pretrained diffusion models. Although subsequent work, such as DreamComposer, attempts to make predictions more controllable by incorporating additional views, the results remain unrealistic due to feature entanglement in the vanilla latent space, including factors such as lighting, material, and structure. To address these issues, we introduce the Visual Isotropy 3D Reconstruction Model (VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates within an ID consistent and perspective-disentangled 3D latent space. By facilitating the disentanglement of semantic information, color, material properties and lighting, VI3DRM is capable of generating highly realistic images that are indistinguishable from real photographs. By leveraging both real and synthesized images, our approach enables the accurate construction of pointmaps, ultimately producing finely textured meshes or point clouds. On the NVS task, tested on the GSO dataset, VI3DRM significantly outperforms state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of 0.929, and an LPIPS of 0.027. Code will be made available upon publication.
最近,Zero-1-2-3 等方法专注于基于单视角的 3D 重建,并取得了显著的成功。然而,它们对未知区域的预测严重依赖于大规模预训练扩散模型的归纳偏差。尽管后来的工作(如 DreamComposer)试图通过加入额外视图来提高预测的可控性,但由于虚潜在空间中的特征纠缠(包括照明、材料和结构等因素),结果仍然不切实际。为了解决这些问题,我们引入了视觉各向同性三维重建模型(Visual Isotropy 3D Reconstruction Model,VI3DRM),这是一种基于扩散的稀疏视图三维重建模型,在 ID 一致且透视解散的三维潜空间中运行。通过促进语义信息、颜色、材料属性和光照的分离,VI3DRM 能够生成与真实照片无异的高度逼真的图像。通过同时利用真实图像和合成图像,我们的方法能够准确构建点阵图,最终生成纹理精细的网格或点云。在GSO数据集上测试的NVS任务中,VI3DRM明显优于最先进的DreamComposer方法,PSNR达到38.61,SSIM达到0.929,LPIPS达到0.027。代码将在发表后公布。
{"title":"VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis","authors":"Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao","doi":"arxiv-2409.08207","DOIUrl":"https://doi.org/arxiv-2409.08207","url":null,"abstract":"Recently, methods like Zero-1-2-3 have focused on single-view based 3D\u0000reconstruction and have achieved remarkable success. However, their predictions\u0000for unseen areas heavily rely on the inductive bias of large-scale pretrained\u0000diffusion models. Although subsequent work, such as DreamComposer, attempts to\u0000make predictions more controllable by incorporating additional views, the\u0000results remain unrealistic due to feature entanglement in the vanilla latent\u0000space, including factors such as lighting, material, and structure. To address\u0000these issues, we introduce the Visual Isotropy 3D Reconstruction Model\u0000(VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates\u0000within an ID consistent and perspective-disentangled 3D latent space. By\u0000facilitating the disentanglement of semantic information, color, material\u0000properties and lighting, VI3DRM is capable of generating highly realistic\u0000images that are indistinguishable from real photographs. By leveraging both\u0000real and synthesized images, our approach enables the accurate construction of\u0000pointmaps, ultimately producing finely textured meshes or point clouds. On the\u0000NVS task, tested on the GSO dataset, VI3DRM significantly outperforms\u0000state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of\u00000.929, and an LPIPS of 0.027. Code will be made available upon publication.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the field of medical microscopic image classification (MIC), CNN-based and Transformer-based models have been extensively studied. However, CNNs struggle with modeling long-range dependencies, limiting their ability to fully utilize semantic information in images. Conversely, Transformers are hampered by the complexity of quadratic computations. To address these challenges, we propose a model based on the Mamba architecture: Microscopic-Mamba. Specifically, we designed the Partially Selected Feed-Forward Network (PSFFN) to replace the last linear layer of the Visual State Space Module (VSSM), enhancing Mamba's local feature extraction capabilities. Additionally, we introduced the Modulation Interaction Feature Aggregation (MIFA) module to effectively modulate and dynamically aggregate global and local features. We also incorporated a parallel VSSM mechanism to improve inter-channel information interaction while reducing the number of parameters. Extensive experiments have demonstrated that our method achieves state-of-the-art performance on five public datasets. Code is available at https://github.com/zs1314/Microscopic-Mamba
{"title":"Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters","authors":"Shun Zou, Zhuo Zhang, Yi Zou, Guangwei Gao","doi":"arxiv-2409.07896","DOIUrl":"https://doi.org/arxiv-2409.07896","url":null,"abstract":"In the field of medical microscopic image classification (MIC), CNN-based and\u0000Transformer-based models have been extensively studied. However, CNNs struggle\u0000with modeling long-range dependencies, limiting their ability to fully utilize\u0000semantic information in images. Conversely, Transformers are hampered by the\u0000complexity of quadratic computations. To address these challenges, we propose a\u0000model based on the Mamba architecture: Microscopic-Mamba. Specifically, we\u0000designed the Partially Selected Feed-Forward Network (PSFFN) to replace the\u0000last linear layer of the Visual State Space Module (VSSM), enhancing Mamba's\u0000local feature extraction capabilities. Additionally, we introduced the\u0000Modulation Interaction Feature Aggregation (MIFA) module to effectively\u0000modulate and dynamically aggregate global and local features. We also\u0000incorporated a parallel VSSM mechanism to improve inter-channel information\u0000interaction while reducing the number of parameters. Extensive experiments have\u0000demonstrated that our method achieves state-of-the-art performance on five\u0000public datasets. Code is available at\u0000https://github.com/zs1314/Microscopic-Mamba","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}