首页 > 最新文献

arXiv - CS - Computer Vision and Pattern Recognition最新文献

英文 中文
MagicStyle: Portrait Stylization Based on Reference Image MagicStyle:基于参考图像的肖像风格化
Pub Date : 2024-09-12 DOI: arxiv-2409.08156
Zhaoli Deng, Kaibin Zhou, Fanyi Wang, Zhenpeng Mi
The development of diffusion models has significantly advanced the researchon image stylization, particularly in the area of stylizing a content imagebased on a given style image, which has attracted many scholars. The mainchallenge in this reference image stylization task lies in how to maintain thedetails of the content image while incorporating the color and texture featuresof the style image. This challenge becomes even more pronounced when thecontent image is a portrait which has complex textural details. To address thischallenge, we propose a diffusion model-based reference image stylizationmethod specifically for portraits, called MagicStyle. MagicStyle consists oftwo phases: Content and Style DDIM Inversion (CSDI) and Feature Fusion Forward(FFF). The CSDI phase involves a reverse denoising process, where DDIMInversion is performed separately on the content image and the style image,storing the self-attention query, key and value features of both images duringthe inversion process. The FFF phase executes forward denoising, harmoniouslyintegrating the texture and color information from the pre-stored featurequeries, keys and values into the diffusion generation process based on ourWell-designed Feature Fusion Attention (FFA). We conducted comprehensivecomparative and ablation experiments to validate the effectiveness of ourproposed MagicStyle and FFA.
扩散模型的发展极大地推动了图像风格化的研究,尤其是在基于给定风格图像的内容图像风格化领域,吸引了众多学者的关注。这种参考图像风格化任务的主要挑战在于如何在保持内容图像细节的同时融入风格图像的色彩和纹理特征。当内容图像是具有复杂纹理细节的肖像时,这一挑战就更加突出。为了解决这一难题,我们提出了一种基于扩散模型的参考图像风格化方法,专门用于人像,称为 MagicStyle。MagicStyle 包括两个阶段:内容与风格 DDIM 反转(CSDI)和特征前向融合(FFF)。CSDI 阶段涉及反向去噪过程,即分别对内容图像和风格图像进行 DDIM 反演,在反演过程中存储两幅图像的自注意力查询、关键特征和值特征。FFF 阶段执行前向去噪,根据我们精心设计的特征融合注意力(FFA),将预先存储的特征查询、键和值中的纹理和颜色信息和谐地整合到扩散生成过程中。我们进行了全面的比较和消融实验,以验证我们提出的 MagicStyle 和 FFA 的有效性。
{"title":"MagicStyle: Portrait Stylization Based on Reference Image","authors":"Zhaoli Deng, Kaibin Zhou, Fanyi Wang, Zhenpeng Mi","doi":"arxiv-2409.08156","DOIUrl":"https://doi.org/arxiv-2409.08156","url":null,"abstract":"The development of diffusion models has significantly advanced the research\u0000on image stylization, particularly in the area of stylizing a content image\u0000based on a given style image, which has attracted many scholars. The main\u0000challenge in this reference image stylization task lies in how to maintain the\u0000details of the content image while incorporating the color and texture features\u0000of the style image. This challenge becomes even more pronounced when the\u0000content image is a portrait which has complex textural details. To address this\u0000challenge, we propose a diffusion model-based reference image stylization\u0000method specifically for portraits, called MagicStyle. MagicStyle consists of\u0000two phases: Content and Style DDIM Inversion (CSDI) and Feature Fusion Forward\u0000(FFF). The CSDI phase involves a reverse denoising process, where DDIM\u0000Inversion is performed separately on the content image and the style image,\u0000storing the self-attention query, key and value features of both images during\u0000the inversion process. The FFF phase executes forward denoising, harmoniously\u0000integrating the texture and color information from the pre-stored feature\u0000queries, keys and values into the diffusion generation process based on our\u0000Well-designed Feature Fusion Attention (FFA). We conducted comprehensive\u0000comparative and ablation experiments to validate the effectiveness of our\u0000proposed MagicStyle and FFA.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Frequency Anti-DreamBooth: Robust Defense Against Image Synthesis 高频反梦境ooth:稳健防御图像合成
Pub Date : 2024-09-12 DOI: arxiv-2409.08167
Takuto Onikubo, Yusuke Matsui
Recently, text-to-image generative models have been misused to createunauthorized malicious images of individuals, posing a growing social problem.Previous solutions, such as Anti-DreamBooth, add adversarial noise to images toprotect them from being used as training data for malicious generation.However, we found that the adversarial noise can be removed by adversarialpurification methods such as DiffPure. Therefore, we propose a new adversarialattack method that adds strong perturbation on the high-frequency areas ofimages to make it more robust to adversarial purification. Our experimentshowed that the adversarial images retained noise even after adversarialpurification, hindering malicious image generation.
最近,文本到图像生成模型被滥用于创建未经授权的个人恶意图像,造成了日益严重的社会问题。以前的解决方案,如反梦境ooth,会在图像中添加对抗噪声,以保护图像不被用作恶意生成的训练数据。因此,我们提出了一种新的对抗攻击方法,在图像的高频区域添加强扰动,使其对对抗净化更具鲁棒性。我们的实验表明,即使经过对抗净化,对抗图像仍会保留噪声,从而阻碍恶意图像的生成。
{"title":"High-Frequency Anti-DreamBooth: Robust Defense Against Image Synthesis","authors":"Takuto Onikubo, Yusuke Matsui","doi":"arxiv-2409.08167","DOIUrl":"https://doi.org/arxiv-2409.08167","url":null,"abstract":"Recently, text-to-image generative models have been misused to create\u0000unauthorized malicious images of individuals, posing a growing social problem.\u0000Previous solutions, such as Anti-DreamBooth, add adversarial noise to images to\u0000protect them from being used as training data for malicious generation.\u0000However, we found that the adversarial noise can be removed by adversarial\u0000purification methods such as DiffPure. Therefore, we propose a new adversarial\u0000attack method that adds strong perturbation on the high-frequency areas of\u0000images to make it more robust to adversarial purification. Our experiment\u0000showed that the adversarial images retained noise even after adversarial\u0000purification, hindering malicious image generation.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"7 12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation IFAdapter:基于文本到图像生成的实例特征控制
Pub Date : 2024-09-12 DOI: arxiv-2409.08240
Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang
While Text-to-Image (T2I) diffusion models excel at generating visuallyappealing images of individual instances, they struggle to accurately positionand control the features generation of multiple instances. The Layout-to-Image(L2I) task was introduced to address the positioning challenges byincorporating bounding boxes as spatial control signals, but it still fallsshort in generating precise instance features. In response, we propose theInstance Feature Generation (IFG) task, which aims to ensure both positionalaccuracy and feature fidelity in generated instances. To address the IFG task,we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhancesfeature depiction by incorporating additional appearance tokens and utilizingan Instance Semantic Map to align instance-level features with spatiallocations. The IFAdapter guides the diffusion process as a plug-and-playmodule, making it adaptable to various community models. For evaluation, wecontribute an IFG benchmark and develop a verification pipeline to objectivelycompare models' abilities to generate instances with accurate positioning andfeatures. Experimental results demonstrate that IFAdapter outperforms othermodels in both quantitative and qualitative evaluations.
虽然文本到图像(T2I)扩散模型在生成单个实例的视觉效果图像方面表现出色,但在精确定位和控制多个实例的特征生成方面却很吃力。为了解决定位难题,我们引入了 "从布局到图像"(Layout-to-Image,L2I)任务,将边界框作为空间控制信号,但它仍然无法生成精确的实例特征。为此,我们提出了实例特征生成(IFG)任务,旨在确保生成实例的定位精度和特征保真度。为了完成 IFG 任务,我们引入了实例特征适配器(IFAdapter)。IFAdapter 加入了额外的外观标记,并利用实例语义图(Instance Semantic Map)将实例级特征与空间分配相一致,从而增强了特征描述。IFAdapter 以即插即用模块的形式指导扩散过程,使其能够适应各种社区模型。为了进行评估,我们提供了一个 IFG 基准并开发了一个验证管道,以客观地比较模型生成具有准确定位和特征的实例的能力。实验结果表明,IFAdapter 在定量和定性评估中都优于其他模型。
{"title":"IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation","authors":"Yinwei Wu, Xianpan Zhou, Bing Ma, Xuefeng Su, Kai Ma, Xinchao Wang","doi":"arxiv-2409.08240","DOIUrl":"https://doi.org/arxiv-2409.08240","url":null,"abstract":"While Text-to-Image (T2I) diffusion models excel at generating visually\u0000appealing images of individual instances, they struggle to accurately position\u0000and control the features generation of multiple instances. The Layout-to-Image\u0000(L2I) task was introduced to address the positioning challenges by\u0000incorporating bounding boxes as spatial control signals, but it still falls\u0000short in generating precise instance features. In response, we propose the\u0000Instance Feature Generation (IFG) task, which aims to ensure both positional\u0000accuracy and feature fidelity in generated instances. To address the IFG task,\u0000we introduce the Instance Feature Adapter (IFAdapter). The IFAdapter enhances\u0000feature depiction by incorporating additional appearance tokens and utilizing\u0000an Instance Semantic Map to align instance-level features with spatial\u0000locations. The IFAdapter guides the diffusion process as a plug-and-play\u0000module, making it adaptable to various community models. For evaluation, we\u0000contribute an IFG benchmark and develop a verification pipeline to objectively\u0000compare models' abilities to generate instances with accurate positioning and\u0000features. Experimental results demonstrate that IFAdapter outperforms other\u0000models in both quantitative and qualitative evaluations.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"61 13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UGAD: Universal Generative AI Detector utilizing Frequency Fingerprints UGAD: 利用频率指纹的通用生成式人工智能探测器
Pub Date : 2024-09-12 DOI: arxiv-2409.07913
Inzamamul Alam, Muhammad Shahid Muneer, Simon S. Woo
In the wake of a fabricated explosion image at the Pentagon, an ability todiscern real images from fake counterparts has never been more critical. Ourstudy introduces a novel multi-modal approach to detect AI-generated imagesamidst the proliferation of new generation methods such as Diffusion models.Our method, UGAD, encompasses three key detection steps: First, we transformthe RGB images into YCbCr channels and apply an Integral Radial Operation toemphasize salient radial features. Secondly, the Spatial Fourier Extractionoperation is used for a spatial shift, utilizing a pre-trained deep learningnetwork for optimal feature extraction. Finally, the deep neural networkclassification stage processes the data through dense layers using softmax forclassification. Our approach significantly enhances the accuracy ofdifferentiating between real and AI-generated images, as evidenced by a 12.64%increase in accuracy and 28.43% increase in AUC compared to existingstate-of-the-art methods.
在五角大楼伪造爆炸图像事件发生后,辨别真假图像的能力变得前所未有的重要。在扩散模型等新一代方法层出不穷的情况下,我们的研究引入了一种新颖的多模态方法来检测人工智能生成的图像:首先,我们将 RGB 图像转换为 YCbCr 通道,并应用积分径向运算来强调突出的径向特征。其次,使用空间傅立叶提取操作进行空间转换,利用预先训练好的深度学习网络进行最佳特征提取。最后,深度神经网络分类阶段通过密集层处理数据,使用 softmax 进行分类。与现有的最先进方法相比,我们的方法大大提高了区分真实图像和人工智能生成图像的准确性,准确率提高了 12.64%,AUC 提高了 28.43%。
{"title":"UGAD: Universal Generative AI Detector utilizing Frequency Fingerprints","authors":"Inzamamul Alam, Muhammad Shahid Muneer, Simon S. Woo","doi":"arxiv-2409.07913","DOIUrl":"https://doi.org/arxiv-2409.07913","url":null,"abstract":"In the wake of a fabricated explosion image at the Pentagon, an ability to\u0000discern real images from fake counterparts has never been more critical. Our\u0000study introduces a novel multi-modal approach to detect AI-generated images\u0000amidst the proliferation of new generation methods such as Diffusion models.\u0000Our method, UGAD, encompasses three key detection steps: First, we transform\u0000the RGB images into YCbCr channels and apply an Integral Radial Operation to\u0000emphasize salient radial features. Secondly, the Spatial Fourier Extraction\u0000operation is used for a spatial shift, utilizing a pre-trained deep learning\u0000network for optimal feature extraction. Finally, the deep neural network\u0000classification stage processes the data through dense layers using softmax for\u0000classification. Our approach significantly enhances the accuracy of\u0000differentiating between real and AI-generated images, as evidenced by a 12.64%\u0000increase in accuracy and 28.43% increase in AUC compared to existing\u0000state-of-the-art methods.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis 手动和非手动手语分析中基于交叉注意力的影响模型
Pub Date : 2024-09-12 DOI: arxiv-2409.08162
Lipisha Chaudhary, Fei Xu, Ifeoma Nwogu
Both manual (relating to the use of hands) and non-manual markers (NMM), suchas facial expressions or mouthing cues, are important for providing thecomplete meaning of phrases in American Sign Language (ASL). Efforts have beenmade in advancing sign language to spoken/written language understanding, butmost of these have primarily focused on manual features only. In this work,using advanced neural machine translation methods, we examine and report on theextent to which facial expressions contribute to understanding sign languagephrases. We present a sign language translation architecture consisting oftwo-stream encoders, with one encoder handling the face and the other handlingthe upper body (with hands). We propose a new parallel cross-attention decodingmechanism that is useful for quantifying the influence of each input modalityon the output. The two streams from the encoder are directed simultaneously todifferent attention stacks in the decoder. Examining the properties of theparallel cross-attention weights allows us to analyze the importance of facialmarkers compared to body and hand features during a translating task.
手动标记(与手的使用有关)和非手动标记(NMM),如面部表情或口型提示,对于提供美国手语(ASL)中短语的完整含义都很重要。人们一直在努力将手语提升到口语/书面语理解的水平,但其中大部分都只侧重于人工特征。在这项工作中,我们使用先进的神经机器翻译方法,研究并报告了面部表情对理解手语短语的贡献程度。我们提出了一种由两个流编码器组成的手语翻译架构,其中一个编码器处理面部,另一个处理上半身(包括手)。我们提出了一种新的并行交叉注意力解码机制,可用于量化每种输入模式对输出的影响。来自编码器的两个数据流同时进入解码器中的不同注意堆栈。通过研究并行交叉注意力权重的特性,我们可以分析在翻译任务中面部标记与身体和手部特征相比的重要性。
{"title":"Cross-Attention Based Influence Model for Manual and Nonmanual Sign Language Analysis","authors":"Lipisha Chaudhary, Fei Xu, Ifeoma Nwogu","doi":"arxiv-2409.08162","DOIUrl":"https://doi.org/arxiv-2409.08162","url":null,"abstract":"Both manual (relating to the use of hands) and non-manual markers (NMM), such\u0000as facial expressions or mouthing cues, are important for providing the\u0000complete meaning of phrases in American Sign Language (ASL). Efforts have been\u0000made in advancing sign language to spoken/written language understanding, but\u0000most of these have primarily focused on manual features only. In this work,\u0000using advanced neural machine translation methods, we examine and report on the\u0000extent to which facial expressions contribute to understanding sign language\u0000phrases. We present a sign language translation architecture consisting of\u0000two-stream encoders, with one encoder handling the face and the other handling\u0000the upper body (with hands). We propose a new parallel cross-attention decoding\u0000mechanism that is useful for quantifying the influence of each input modality\u0000on the output. The two streams from the encoder are directed simultaneously to\u0000different attention stacks in the decoder. Examining the properties of the\u0000parallel cross-attention weights allows us to analyze the importance of facial\u0000markers compared to body and hand features during a translating task.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Style Based Clustering of Visual Artworks 基于风格的视觉艺术作品聚类
Pub Date : 2024-09-12 DOI: arxiv-2409.08245
Abhishek Dangeti, Pavan Gajula, Vivek Srivastava, Vikram Jamwal
Clustering artworks based on style has many potential real-world applicationslike art recommendations, style-based search and retrieval, and the study ofartistic style evolution in an artwork corpus. However, clustering artworksbased on style is largely an unaddressed problem. A few present methods forclustering artworks principally rely on generic image feature representationsderived from deep neural networks and do not specifically deal with theartistic style. In this paper, we introduce and deliberate over the notion ofstyle-based clustering of visual artworks. Our main objective is to exploreneural feature representations and architectures that can be used forstyle-based clustering and observe their impact and effectiveness. We developdifferent methods and assess their relative efficacy for style-based clusteringthrough qualitative and quantitative analysis by applying them to four artworkcorpora and four curated synthetically styled datasets. Our analysis providessome key novel insights on architectures, feature representations, andevaluation methods suitable for style-based clustering.
基于风格对艺术作品进行聚类有许多潜在的现实应用,如艺术推荐、基于风格的搜索和检索,以及研究艺术作品语料库中艺术风格的演变。然而,基于风格对艺术作品进行聚类在很大程度上是一个尚未解决的问题。现有的几种艺术作品聚类方法主要依赖于深度神经网络生成的通用图像特征表征,并没有专门针对艺术风格进行处理。在本文中,我们介绍并讨论了基于风格的视觉艺术作品聚类概念。我们的主要目标是探索可用于基于风格聚类的神经特征表示和架构,并观察它们的影响和效果。我们开发了不同的方法,并通过定性和定量分析评估了它们在基于风格的聚类中的相对功效,并将其应用于四个艺术作品集和四个经过策划的合成风格数据集。我们的分析为适合基于风格的聚类的架构、特征表示和评估方法提供了一些关键的新见解。
{"title":"Style Based Clustering of Visual Artworks","authors":"Abhishek Dangeti, Pavan Gajula, Vivek Srivastava, Vikram Jamwal","doi":"arxiv-2409.08245","DOIUrl":"https://doi.org/arxiv-2409.08245","url":null,"abstract":"Clustering artworks based on style has many potential real-world applications\u0000like art recommendations, style-based search and retrieval, and the study of\u0000artistic style evolution in an artwork corpus. However, clustering artworks\u0000based on style is largely an unaddressed problem. A few present methods for\u0000clustering artworks principally rely on generic image feature representations\u0000derived from deep neural networks and do not specifically deal with the\u0000artistic style. In this paper, we introduce and deliberate over the notion of\u0000style-based clustering of visual artworks. Our main objective is to explore\u0000neural feature representations and architectures that can be used for\u0000style-based clustering and observe their impact and effectiveness. We develop\u0000different methods and assess their relative efficacy for style-based clustering\u0000through qualitative and quantitative analysis by applying them to four artwork\u0000corpora and four curated synthetically styled datasets. Our analysis provides\u0000some key novel insights on architectures, feature representations, and\u0000evaluation methods suitable for style-based clustering.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"72 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Control+Shift: Generating Controllable Distribution Shifts 控制+转变:产生可控的分配转变
Pub Date : 2024-09-12 DOI: arxiv-2409.07940
Roy Friedman, Rhea Chowers
We propose a new method for generating realistic datasets with distributionshifts using any decoder-based generative model. Our approach systematicallycreates datasets with varying intensities of distribution shifts, facilitatinga comprehensive analysis of model performance degradation. We then use thesegenerated datasets to evaluate the performance of various commonly usednetworks and observe a consistent decline in performance with increasing shiftintensity, even when the effect is almost perceptually unnoticeable to thehuman eye. We see this degradation even when using data augmentations. We alsofind that enlarging the training dataset beyond a certain point has no effecton the robustness and that stronger inductive biases increase robustness.
我们提出了一种新方法,可以使用任何基于解码器的生成模型生成具有分布偏移的真实数据集。我们的方法可以系统地创建具有不同分布偏移强度的数据集,从而有助于对模型性能退化进行全面分析。然后,我们使用这些生成的数据集来评估各种常用网络的性能,并观察到性能随着偏移强度的增加而持续下降,即使人眼几乎感觉不到这种影响。即使在使用数据增强时,我们也能看到这种性能下降。我们还发现,将训练数据集扩大到一定程度后对鲁棒性没有影响,而更强的归纳偏差会提高鲁棒性。
{"title":"Control+Shift: Generating Controllable Distribution Shifts","authors":"Roy Friedman, Rhea Chowers","doi":"arxiv-2409.07940","DOIUrl":"https://doi.org/arxiv-2409.07940","url":null,"abstract":"We propose a new method for generating realistic datasets with distribution\u0000shifts using any decoder-based generative model. Our approach systematically\u0000creates datasets with varying intensities of distribution shifts, facilitating\u0000a comprehensive analysis of model performance degradation. We then use these\u0000generated datasets to evaluate the performance of various commonly used\u0000networks and observe a consistent decline in performance with increasing shift\u0000intensity, even when the effect is almost perceptually unnoticeable to the\u0000human eye. We see this degradation even when using data augmentations. We also\u0000find that enlarging the training dataset beyond a certain point has no effect\u0000on the robustness and that stronger inductive biases increase robustness.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"155 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scribble-Guided Diffusion for Training-free Text-to-Image Generation 用于免训练文本到图像生成的涂鸦引导扩散技术
Pub Date : 2024-09-12 DOI: arxiv-2409.08026
Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim
Recent advancements in text-to-image diffusion models have demonstratedremarkable success, yet they often struggle to fully capture the user's intent.Existing approaches using textual inputs combined with bounding boxes or regionmasks fall short in providing precise spatial guidance, often leading tomisaligned or unintended object orientation. To address these limitations, wepropose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach thatutilizes simple user-provided scribbles as visual prompts to guide imagegeneration. However, incorporating scribbles into diffusion models presentschallenges due to their sparse and thin nature, making it difficult to ensureaccurate orientation alignment. To overcome these challenges, we introducemoment alignment and scribble propagation, which allow for more effective andflexible alignment between generated images and scribble inputs. Experimentalresults on the PASCAL-Scribble dataset demonstrate significant improvements inspatial control and consistency, showcasing the effectiveness of scribble-basedguidance in diffusion models. Our code is available athttps://github.com/kaist-cvml-lab/scribble-diffusion.
文本到图像扩散模型的最新进展已经取得了显著的成功,但它们往往难以完全捕捉用户的意图。现有的方法使用文本输入与边界框或区域掩码相结合,在提供精确的空间引导方面存在不足,经常会导致对象定位不准或意外定位。为了解决这些局限性,我们提出了涂鸦引导扩散(ScribbleDiff),这是一种无需训练的方法,利用用户提供的简单涂鸦作为视觉提示来引导图像生成。然而,将涂鸦纳入扩散模型会面临挑战,因为涂鸦稀疏且薄,很难确保方向对齐的准确性。为了克服这些挑战,我们引入了瞬间配准和涂鸦传播,从而在生成的图像和涂鸦输入之间实现更有效、更灵活的配准。在PASCAL-Scribble数据集上的实验结果表明,空间控制和一致性有了显著改善,展示了基于scribble的导航在扩散模型中的有效性。我们的代码可在https://github.com/kaist-cvml-lab/scribble-diffusion。
{"title":"Scribble-Guided Diffusion for Training-free Text-to-Image Generation","authors":"Seonho Lee, Jiho Choi, Seohyun Lim, Jiwook Kim, Hyunjung Shim","doi":"arxiv-2409.08026","DOIUrl":"https://doi.org/arxiv-2409.08026","url":null,"abstract":"Recent advancements in text-to-image diffusion models have demonstrated\u0000remarkable success, yet they often struggle to fully capture the user's intent.\u0000Existing approaches using textual inputs combined with bounding boxes or region\u0000masks fall short in providing precise spatial guidance, often leading to\u0000misaligned or unintended object orientation. To address these limitations, we\u0000propose Scribble-Guided Diffusion (ScribbleDiff), a training-free approach that\u0000utilizes simple user-provided scribbles as visual prompts to guide image\u0000generation. However, incorporating scribbles into diffusion models presents\u0000challenges due to their sparse and thin nature, making it difficult to ensure\u0000accurate orientation alignment. To overcome these challenges, we introduce\u0000moment alignment and scribble propagation, which allow for more effective and\u0000flexible alignment between generated images and scribble inputs. Experimental\u0000results on the PASCAL-Scribble dataset demonstrate significant improvements in\u0000spatial control and consistency, showcasing the effectiveness of scribble-based\u0000guidance in diffusion models. Our code is available at\u0000https://github.com/kaist-cvml-lab/scribble-diffusion.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis VI3DRM:通过逼真的新颖视图合成从稀疏视图实现细致的三维重建
Pub Date : 2024-09-12 DOI: arxiv-2409.08207
Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao
Recently, methods like Zero-1-2-3 have focused on single-view based 3Dreconstruction and have achieved remarkable success. However, their predictionsfor unseen areas heavily rely on the inductive bias of large-scale pretraineddiffusion models. Although subsequent work, such as DreamComposer, attempts tomake predictions more controllable by incorporating additional views, theresults remain unrealistic due to feature entanglement in the vanilla latentspace, including factors such as lighting, material, and structure. To addressthese issues, we introduce the Visual Isotropy 3D Reconstruction Model(VI3DRM), a diffusion-based sparse views 3D reconstruction model that operateswithin an ID consistent and perspective-disentangled 3D latent space. Byfacilitating the disentanglement of semantic information, color, materialproperties and lighting, VI3DRM is capable of generating highly realisticimages that are indistinguishable from real photographs. By leveraging bothreal and synthesized images, our approach enables the accurate construction ofpointmaps, ultimately producing finely textured meshes or point clouds. On theNVS task, tested on the GSO dataset, VI3DRM significantly outperformsstate-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of0.929, and an LPIPS of 0.027. Code will be made available upon publication.
最近,Zero-1-2-3 等方法专注于基于单视角的 3D 重建,并取得了显著的成功。然而,它们对未知区域的预测严重依赖于大规模预训练扩散模型的归纳偏差。尽管后来的工作(如 DreamComposer)试图通过加入额外视图来提高预测的可控性,但由于虚潜在空间中的特征纠缠(包括照明、材料和结构等因素),结果仍然不切实际。为了解决这些问题,我们引入了视觉各向同性三维重建模型(Visual Isotropy 3D Reconstruction Model,VI3DRM),这是一种基于扩散的稀疏视图三维重建模型,在 ID 一致且透视解散的三维潜空间中运行。通过促进语义信息、颜色、材料属性和光照的分离,VI3DRM 能够生成与真实照片无异的高度逼真的图像。通过同时利用真实图像和合成图像,我们的方法能够准确构建点阵图,最终生成纹理精细的网格或点云。在GSO数据集上测试的NVS任务中,VI3DRM明显优于最先进的DreamComposer方法,PSNR达到38.61,SSIM达到0.929,LPIPS达到0.027。代码将在发表后公布。
{"title":"VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis","authors":"Hao Chen, Jiafu Wu, Ying Jin, Jinlong Peng, Xiaofeng Mao, Mingmin Chi, Mufeng Yao, Bo Peng, Jian Li, Yun Cao","doi":"arxiv-2409.08207","DOIUrl":"https://doi.org/arxiv-2409.08207","url":null,"abstract":"Recently, methods like Zero-1-2-3 have focused on single-view based 3D\u0000reconstruction and have achieved remarkable success. However, their predictions\u0000for unseen areas heavily rely on the inductive bias of large-scale pretrained\u0000diffusion models. Although subsequent work, such as DreamComposer, attempts to\u0000make predictions more controllable by incorporating additional views, the\u0000results remain unrealistic due to feature entanglement in the vanilla latent\u0000space, including factors such as lighting, material, and structure. To address\u0000these issues, we introduce the Visual Isotropy 3D Reconstruction Model\u0000(VI3DRM), a diffusion-based sparse views 3D reconstruction model that operates\u0000within an ID consistent and perspective-disentangled 3D latent space. By\u0000facilitating the disentanglement of semantic information, color, material\u0000properties and lighting, VI3DRM is capable of generating highly realistic\u0000images that are indistinguishable from real photographs. By leveraging both\u0000real and synthesized images, our approach enables the accurate construction of\u0000pointmaps, ultimately producing finely textured meshes or point clouds. On the\u0000NVS task, tested on the GSO dataset, VI3DRM significantly outperforms\u0000state-of-the-art method DreamComposer, achieving a PSNR of 38.61, an SSIM of\u00000.929, and an LPIPS of 0.027. Code will be made available upon publication.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters 显微曼巴仅用 4M 参数揭示显微图像的秘密
Pub Date : 2024-09-12 DOI: arxiv-2409.07896
Shun Zou, Zhuo Zhang, Yi Zou, Guangwei Gao
In the field of medical microscopic image classification (MIC), CNN-based andTransformer-based models have been extensively studied. However, CNNs strugglewith modeling long-range dependencies, limiting their ability to fully utilizesemantic information in images. Conversely, Transformers are hampered by thecomplexity of quadratic computations. To address these challenges, we propose amodel based on the Mamba architecture: Microscopic-Mamba. Specifically, wedesigned the Partially Selected Feed-Forward Network (PSFFN) to replace thelast linear layer of the Visual State Space Module (VSSM), enhancing Mamba'slocal feature extraction capabilities. Additionally, we introduced theModulation Interaction Feature Aggregation (MIFA) module to effectivelymodulate and dynamically aggregate global and local features. We alsoincorporated a parallel VSSM mechanism to improve inter-channel informationinteraction while reducing the number of parameters. Extensive experiments havedemonstrated that our method achieves state-of-the-art performance on fivepublic datasets. Code is available athttps://github.com/zs1314/Microscopic-Mamba
在医学显微图像分类(MIC)领域,基于 CNN 和基于变换器的模型已被广泛研究。然而,CNN 在长程依赖性建模方面存在困难,限制了其充分利用图像中语义信息的能力。相反,变换器则受到二次计算复杂性的阻碍。为了应对这些挑战,我们提出了基于 Mamba 架构的模型:微观曼巴。具体来说,我们设计了部分选择前馈网络(PSFFN)来取代视觉状态空间模块(VSSM)的最后一层线性层,从而增强了 Mamba 的局部特征提取能力。此外,我们还引入了调制交互特征聚合(MIFA)模块,以有效调制并动态聚合全局和局部特征。我们还加入了并行 VSSM 机制,在减少参数数量的同时改善通道间的信息交互。广泛的实验证明,我们的方法在五个公开数据集上达到了最先进的性能。代码见:https://github.com/zs1314/Microscopic-Mamba
{"title":"Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters","authors":"Shun Zou, Zhuo Zhang, Yi Zou, Guangwei Gao","doi":"arxiv-2409.07896","DOIUrl":"https://doi.org/arxiv-2409.07896","url":null,"abstract":"In the field of medical microscopic image classification (MIC), CNN-based and\u0000Transformer-based models have been extensively studied. However, CNNs struggle\u0000with modeling long-range dependencies, limiting their ability to fully utilize\u0000semantic information in images. Conversely, Transformers are hampered by the\u0000complexity of quadratic computations. To address these challenges, we propose a\u0000model based on the Mamba architecture: Microscopic-Mamba. Specifically, we\u0000designed the Partially Selected Feed-Forward Network (PSFFN) to replace the\u0000last linear layer of the Visual State Space Module (VSSM), enhancing Mamba's\u0000local feature extraction capabilities. Additionally, we introduced the\u0000Modulation Interaction Feature Aggregation (MIFA) module to effectively\u0000modulate and dynamically aggregate global and local features. We also\u0000incorporated a parallel VSSM mechanism to improve inter-channel information\u0000interaction while reducing the number of parameters. Extensive experiments have\u0000demonstrated that our method achieves state-of-the-art performance on five\u0000public datasets. Code is available at\u0000https://github.com/zs1314/Microscopic-Mamba","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142221557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Computer Vision and Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1