首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Text to image synthesis with multi-granularity feature aware enhancement Generative Adversarial Networks 利用多粒度特征感知增强生成式对抗网络进行文本到图像的合成
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-05-20 DOI: 10.1016/j.cviu.2024.104042
Pei Dong, Lei Wu, Ruichen Li, Xiangxu Meng, Lei Meng

Synthesizing complex images from text presents challenging. Compared to autoregressive and diffusion model-based methods, Generative Adversarial Network-based methods have significant advantages in terms of computational cost and generation efficiency yet remain two limitations: first, these methods often refine all features output from the previous stage indiscriminately, without considering these features are initialized gradually during the generation process; second, the sparse semantic constraints provided by the text description are typically ineffective for refining fine-grained features. These issues complicate the balance between generation quality, computational cost and inference speed. To address these issues, we propose a Multi-granularity Feature Aware Enhancement GAN (MFAE-GAN), which allows the refinement process to match the order of different granularity features being initialized. Specifically, MFAE-GAN (1) samples category-related coarse-grained features and instance-level detail-related fine-grained features at different generation stages based on different attention mechanisms in Coarse-grained Feature Enhancement (CFE) and Fine-grained Feature Enhancement (FFE) to guide the generation process spatially, (2) provides denser semantic constraints than textual semantic information through Multi-granularity Features Adaptive Batch Normalization (MFA-BN) in the process of refining fine-grained features, and (3) adopts a Global Semantics Preservation (GSP) to avoid the loss of global semantics when sampling features continuously. Extensive experimental results demonstrate that our MFAE-GAN is competitive in terms of both image generation quality and efficiency.

从文本合成复杂图像具有挑战性。与基于自回归模型和扩散模型的方法相比,基于生成对抗网络的方法在计算成本和生成效率方面具有显著优势,但仍存在两个局限性:首先,这些方法通常会不加区分地细化前一阶段输出的所有特征,而不考虑这些特征是在生成过程中逐渐初始化的;其次,文本描述提供的稀疏语义约束通常对细粒度特征的细化无效。这些问题使生成质量、计算成本和推理速度之间的平衡变得更加复杂。为了解决这些问题,我们提出了多粒度特征感知增强型 GAN(MFAE-GAN),它允许细化过程与初始化的不同粒度特征的顺序相匹配。具体来说,MFAE-GAN (1) 基于粗粒度特征增强(CFE)和细粒度特征增强(FFE)的不同注意机制,在不同生成阶段对类别相关的粗粒度特征和实例级细节相关的细粒度特征进行采样,从而在空间上引导生成过程、(2) 在提炼细粒度特征的过程中,通过多粒度特征自适应批量归一化(MFA-BN)提供比文本语义信息更密集的语义约束;以及 (3) 采用全局语义保留(GSP)技术,避免连续采样特征时全局语义的丢失。广泛的实验结果表明,我们的 MFAE-GAN 在图像生成质量和效率方面都很有竞争力。
{"title":"Text to image synthesis with multi-granularity feature aware enhancement Generative Adversarial Networks","authors":"Pei Dong,&nbsp;Lei Wu,&nbsp;Ruichen Li,&nbsp;Xiangxu Meng,&nbsp;Lei Meng","doi":"10.1016/j.cviu.2024.104042","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104042","url":null,"abstract":"<div><p>Synthesizing complex images from text presents challenging. Compared to autoregressive and diffusion model-based methods, Generative Adversarial Network-based methods have significant advantages in terms of computational cost and generation efficiency yet remain two limitations: first, these methods often refine all features output from the previous stage indiscriminately, without considering these features are initialized gradually during the generation process; second, the sparse semantic constraints provided by the text description are typically ineffective for refining fine-grained features. These issues complicate the balance between generation quality, computational cost and inference speed. To address these issues, we propose a Multi-granularity Feature Aware Enhancement GAN (MFAE-GAN), which allows the refinement process to match the order of different granularity features being initialized. Specifically, MFAE-GAN (1) samples category-related coarse-grained features and instance-level detail-related fine-grained features at different generation stages based on different attention mechanisms in Coarse-grained Feature Enhancement (CFE) and Fine-grained Feature Enhancement (FFE) to guide the generation process spatially, (2) provides denser semantic constraints than textual semantic information through Multi-granularity Features Adaptive Batch Normalization (MFA-BN) in the process of refining fine-grained features, and (3) adopts a Global Semantics Preservation (GSP) to avoid the loss of global semantics when sampling features continuously. Extensive experimental results demonstrate that our MFAE-GAN is competitive in terms of both image generation quality and efficiency.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141097597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Complete contextual information extraction for self-supervised monocular depth estimation 用于自监督单目深度估计的完整上下文信息提取
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-05-15 DOI: 10.1016/j.cviu.2024.104032
Dazheng Zhou , Mingliang Zhang , Xianjie Gao , Youmei Zhang , Bin Li

Self-supervised learning methods are increasingly important for monocular depth estimation since they do not require ground-truth data during training. Although existing methods have achieved great success for better monocular depth estimation based on Convolutional Neural Networks (CNNs), the limited receptive field of CNNs usually is insufficient to effectively model the global information, e.g., relationship between foreground and background or relationship among objects, which are crucial for accurately capturing scene structure. Recently, some studies based on Transformers have attracted significant interest in computer vision. However, duo to the lack of spatial locality bias, they may fail to model the local information, e.g., fine-grained details with an image. To tackle these issues, we propose a novel self-supervised learning framework by incorporating the advantages of both the CNNs and Transformers so as to model the complete contextual information for high-quality monocular depth estimation. Specifically, the proposed method mainly includes two branches, where the Transformer branch is considered to capture the global information while the Convolution branch is exploited to preserve the local information. We also design a rectangle convolution module with pyramid structure to perceive the semi-global information, e.g. thin objects, along the horizontal and vertical directions within an image. Moreover, we propose a shape refinement module by learning the affinity matrix between pixel and its neighborhood to obtain accurate geometrical structure of scenes. Extensive experiments evaluated on KITTI, Cityscapes and Make3D dataset demonstrate that the proposed method achieves the competitive result compared with the state-of-the-art self-supervised monocular depth estimation methods and shows good cross-dataset generalization ability.

自我监督学习方法在单目深度估算中越来越重要,因为它们在训练过程中不需要地面实况数据。虽然现有的基于卷积神经网络(CNNs)的方法在更好地进行单目深度估计方面取得了巨大成功,但 CNNs 有限的感受野通常不足以有效地模拟全局信息,例如前景与背景之间的关系或物体之间的关系,而这些信息对于准确捕捉场景结构至关重要。最近,一些基于变形器的研究引起了计算机视觉领域的极大兴趣。然而,由于缺乏空间局部性偏差,它们可能无法对局部信息(如图像的细粒度细节)进行建模。为了解决这些问题,我们提出了一种新的自监督学习框架,它结合了 CNN 和变换器的优点,从而为高质量的单目深度估计建立完整的上下文信息模型。具体来说,所提出的方法主要包括两个分支,其中变换器分支用于捕捉全局信息,而卷积分支则用于保留局部信息。我们还设计了一个具有金字塔结构的矩形卷积模块,以感知图像中沿水平和垂直方向的半全局信息,例如细小物体。此外,我们还提出了一个形状细化模块,通过学习像素与其邻域之间的亲和矩阵来获得精确的场景几何结构。在 KITTI、Cityscapes 和 Make3D 数据集上进行的大量实验表明,与最先进的自监督单目深度估计方法相比,所提出的方法取得了具有竞争力的结果,并显示出良好的跨数据集泛化能力。
{"title":"Complete contextual information extraction for self-supervised monocular depth estimation","authors":"Dazheng Zhou ,&nbsp;Mingliang Zhang ,&nbsp;Xianjie Gao ,&nbsp;Youmei Zhang ,&nbsp;Bin Li","doi":"10.1016/j.cviu.2024.104032","DOIUrl":"10.1016/j.cviu.2024.104032","url":null,"abstract":"<div><p>Self-supervised learning methods are increasingly important for monocular depth estimation since they do not require ground-truth data during training. Although existing methods have achieved great success for better monocular depth estimation based on Convolutional Neural Networks (CNNs), the limited receptive field of CNNs usually is insufficient to effectively model the global information, e.g., relationship between foreground and background or relationship among objects, which are crucial for accurately capturing scene structure. Recently, some studies based on Transformers have attracted significant interest in computer vision. However, duo to the lack of spatial locality bias, they may fail to model the local information, e.g., fine-grained details with an image. To tackle these issues, we propose a novel self-supervised learning framework by incorporating the advantages of both the CNNs and Transformers so as to model the complete contextual information for high-quality monocular depth estimation. Specifically, the proposed method mainly includes two branches, where the Transformer branch is considered to capture the global information while the Convolution branch is exploited to preserve the local information. We also design a rectangle convolution module with pyramid structure to perceive the semi-global information, e.g. thin objects, along the horizontal and vertical directions within an image. Moreover, we propose a shape refinement module by learning the affinity matrix between pixel and its neighborhood to obtain accurate geometrical structure of scenes. Extensive experiments evaluated on KITTI, Cityscapes and Make3D dataset demonstrate that the proposed method achieves the competitive result compared with the state-of-the-art self-supervised monocular depth estimation methods and shows good cross-dataset generalization ability.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141023280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Digital image defogging using joint Retinex theory and independent component analysis 利用联合 Retinex 理论和独立成分分析法进行数字图像除雾
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-05-14 DOI: 10.1016/j.cviu.2024.104033
Hossein Noori , Mohammad Hossein Gholizadeh , Hossein Khodabakhshi Rafsanjani

The images captured under adverse weather conditions suffer from poor visibility and contrast problems. Such images are not suitable for computer vision analysis and similar applications. Therefore, image defogging/dehazing is one of the most intriguing topics. In this paper, a new, fast, and robust defogging/de-hazing algorithm is proposed by combining the Retinex theory with independent component analysis, which performs better than existing algorithms. Initially, the foggy image is decomposed into two components: reflectance and luminance. The former is computed using the Retinex theory, while the latter is obtained by decomposing the foggy image into parallel and perpendicular components of air-light. Finally, the defogged image is obtained by applying Koschmieder’s law. Simulation results demonstrate the absence of halo effects and the presence of high-resolution images. The simulation results also confirm the effectiveness of the proposed method when compared to other conventional techniques in terms of NIQE, FADE, SSIM, PSNR, AG, CIEDE2000, r̄, and implementation time. All foggy and defogged results are available in high quality at the following link: https://drive.google.com/file/d/1OStXrfzdnF43gr6PAnBd8BHeThOfj33z/view?usp=drive_link.

在恶劣天气条件下拍摄的图像存在能见度和对比度差的问题。这样的图像不适合计算机视觉分析和类似应用。因此,图像除雾/去雾是最引人关注的话题之一。本文通过将 Retinex 理论与独立分量分析相结合,提出了一种新型、快速、鲁棒的除雾/去雾算法,其性能优于现有算法。首先,将雾图像分解为两个分量:反射率和亮度。前者利用 Retinex 理论计算,后者则通过将雾图像分解为平行和垂直的气光分量而得到。最后,应用科施米德定律得到去雾图像。模拟结果表明不存在光晕效应和高分辨率图像。仿真结果还证实,在 NIQE、FADE、SSIM、PSNR、AG、CIEDE2000、r̄ 和执行时间方面,与其他传统技术相比,建议的方法非常有效。所有雾化和去雾结果的高质量版本请访问以下链接:https://drive.google.com/file/d/1OStXrfzdnF43gr6PAnBd8BHeThOfj33z/view?usp=drive_link。
{"title":"Digital image defogging using joint Retinex theory and independent component analysis","authors":"Hossein Noori ,&nbsp;Mohammad Hossein Gholizadeh ,&nbsp;Hossein Khodabakhshi Rafsanjani","doi":"10.1016/j.cviu.2024.104033","DOIUrl":"10.1016/j.cviu.2024.104033","url":null,"abstract":"<div><p>The images captured under adverse weather conditions suffer from poor visibility and contrast problems. Such images are not suitable for computer vision analysis and similar applications. Therefore, image defogging/dehazing is one of the most intriguing topics. In this paper, a new, fast, and robust defogging/de-hazing algorithm is proposed by combining the Retinex theory with independent component analysis, which performs better than existing algorithms. Initially, the foggy image is decomposed into two components: reflectance and luminance. The former is computed using the Retinex theory, while the latter is obtained by decomposing the foggy image into parallel and perpendicular components of air-light. Finally, the defogged image is obtained by applying Koschmieder’s law. Simulation results demonstrate the absence of halo effects and the presence of high-resolution images. The simulation results also confirm the effectiveness of the proposed method when compared to other conventional techniques in terms of NIQE, FADE, SSIM, PSNR, AG, CIEDE2000, <span><math><mover><mrow><mi>r</mi></mrow><mrow><mo>̄</mo></mrow></mover></math></span>, and implementation time. All foggy and defogged results are available in high quality at the following link: <span>https://drive.google.com/file/d/1OStXrfzdnF43gr6PAnBd8BHeThOfj33z/view?usp=drive_link</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141035541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Take a prior from other tasks for severe blur removal 从其他任务中抽出时间进行严重模糊消除
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-05-10 DOI: 10.1016/j.cviu.2024.104027
Pei Wang , Yu Zhu , Danna Xue , Qingsen Yan , Jinqiu Sun , Sung-eui Yoon , Yanning Zhang

Recovering clear structures from severely blurry inputs is a huge challenge due to the detail loss and ambiguous semantics. Although segmentation maps can help deblur facial images, their effectiveness is limited in complex natural scenes because they ignore the detailed structures necessary for deblurring. Furthermore, direct segmentation of blurry images may introduce error propagation. To alleviate the semantic confusion and avoid error propagation, we propose utilizing high-level vision tasks, such as classification, to learn a comprehensive prior for severe blur removal. We propose a feature learning strategy based on knowledge distillation, which aims to learn the priors with global contexts and sharp local structures. To integrate the priors effectively, we propose a semantic prior embedding layer with multi-level aggregation and semantic attention. We validate our method on natural image deblurring benchmarks by introducing the priors to various models, including UNet and mainstream deblurring baselines, to demonstrate its effectiveness and generalization ability. The results show that our approach outperforms existing methods on severe blur removal with our plug-and-play semantic priors.

由于细节丢失和语义模糊,从严重模糊的输入图像中恢复清晰的结构是一项巨大的挑战。虽然分割图可以帮助去模糊面部图像,但在复杂的自然场景中效果有限,因为它们忽略了去模糊所需的细节结构。此外,直接分割模糊图像可能会带来误差传播。为了缓解语义混乱并避免错误传播,我们建议利用高级视觉任务(如分类)来学习严重模糊去除的综合先验。我们提出了一种基于知识提炼的特征学习策略,旨在学习具有全局上下文和清晰局部结构的先验。为了有效整合先验,我们提出了一个具有多级聚合和语义关注的语义先验嵌入层。我们在自然图像去模糊基准上验证了我们的方法,将先验引入各种模型,包括 UNet 和主流去模糊基准,以证明其有效性和泛化能力。结果表明,我们的方法在即插即用语义先验的严重模糊去除方面优于现有方法。
{"title":"Take a prior from other tasks for severe blur removal","authors":"Pei Wang ,&nbsp;Yu Zhu ,&nbsp;Danna Xue ,&nbsp;Qingsen Yan ,&nbsp;Jinqiu Sun ,&nbsp;Sung-eui Yoon ,&nbsp;Yanning Zhang","doi":"10.1016/j.cviu.2024.104027","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104027","url":null,"abstract":"<div><p>Recovering clear structures from severely blurry inputs is a huge challenge due to the detail loss and ambiguous semantics. Although segmentation maps can help deblur facial images, their effectiveness is limited in complex natural scenes because they ignore the detailed structures necessary for deblurring. Furthermore, direct segmentation of blurry images may introduce error propagation. To alleviate the semantic confusion and avoid error propagation, we propose utilizing high-level vision tasks, such as classification, to learn a comprehensive prior for severe blur removal. We propose a feature learning strategy based on knowledge distillation, which aims to learn the priors with global contexts and sharp local structures. To integrate the priors effectively, we propose a semantic prior embedding layer with multi-level aggregation and semantic attention. We validate our method on natural image deblurring benchmarks by introducing the priors to various models, including UNet and mainstream deblurring baselines, to demonstrate its effectiveness and generalization ability. The results show that our approach outperforms existing methods on severe blur removal with our plug-and-play semantic priors.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141077672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Other tokens matter: Exploring global and local features of Vision Transformers for Object Re-Identification 其他标记也很重要探索视觉变换器的全局和局部特征,实现物体再识别
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-05-03 DOI: 10.1016/j.cviu.2024.104030
Yingquan Wang , Pingping Zhang , Dong Wang , Huchuan Lu

Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global–local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global–Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we propose the Local Multi-layer Fusion (LMF) which leverages both the global cues from GAE and multi-layer patch tokens to explore the discriminative local representations. Extensive experiments demonstrate that our proposed method achieves superior performance on four object Re-ID benchmarks. The code is available at https://github.com/AWangYQ/GLTrans.

物体再识别(Re-ID)旨在从不同地点和时间拍摄的图像中识别和检索特定物体。最近,随着视觉变换器(ViT)的发展,物体再识别取得了巨大成功。然而,在用于物体再识别的变换器中,全局-局部关系的影响尚未得到充分探讨。在这项工作中,我们首先探讨了 ViT 全局和局部特征的影响,然后进一步提出了一种新型的全局-局部变换器(GLTrans),用于高性能的物体再识别。我们发现,ViT 最后几层的特征已经具有很强的表征能力,而且全局和局部信息可以相互促进。基于这一事实,我们提出了全局聚合编码器(GAE),利用变换器最后几层的类标记,有效地学习全面的全局特征。同时,我们还提出了局部多层融合(LMF),利用来自 GAE 的全局线索和多层补丁标记来探索具有区分性的局部表征。广泛的实验证明,我们提出的方法在四个物体再识别基准测试中取得了优异的性能。代码见 https://github.com/AWangYQ/GLTrans。
{"title":"Other tokens matter: Exploring global and local features of Vision Transformers for Object Re-Identification","authors":"Yingquan Wang ,&nbsp;Pingping Zhang ,&nbsp;Dong Wang ,&nbsp;Huchuan Lu","doi":"10.1016/j.cviu.2024.104030","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104030","url":null,"abstract":"<div><p>Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global–local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global–Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we propose the Local Multi-layer Fusion (LMF) which leverages both the global cues from GAE and multi-layer patch tokens to explore the discriminative local representations. Extensive experiments demonstrate that our proposed method achieves superior performance on four object Re-ID benchmarks. The code is available at <span>https://github.com/AWangYQ/GLTrans</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140901235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An unsupervised multi-focus image fusion method via dual-channel convolutional network and discriminator 通过双通道卷积网络和判别器实现无监督多焦点图像融合方法
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-05-01 DOI: 10.1016/j.cviu.2024.104029
Lixing Fang , Xiangxiang Wang , Junli Zhao , Zhenkuan Pan , Hui Li , Yi Li

The challenge in multi-focus image fusion tasks lies in accurately preserving the complementary information from the source images in the fused image. However, existing datasets often lack ground truth images, making it difficult for some full-reference loss functions (such as SSIM) to effectively participate in model training, thereby further affecting the performance of retaining source image details. To address this issue, this paper proposes an unsupervised dual-channel dense convolutional method, DCD, for multi-focus image fusion. DCD designs Patch processing blocks specifically for the fusion task, which segment the source image pairs into equally sized patches and evaluate their information to obtain a reconstructed image and a set of adaptive weight coefficients. The reconstructed image is used as the reference image, enabling unsupervised methods to utilize full-reference loss functions in training and overcoming the challenge of lacking labeled data in the training set. Furthermore, considering that the human visual system (HVS) is more sensitive to brightness than color, DCD trains the dual-channel network using both RGB images and their luminance components. This allows the network to focus more on the brightness information while preserving the color and gradient details of the source images, resulting in fused images that are more compatible with the HVS. The adaptive weight coefficients obtained through the Patch processing blocks are also used to determine the degree of preservation of the brightness information in the source images. Finally, comparative experiments on different datasets also demonstrate the superior performance of DCD in terms of fused image quality compared to other methods.

多焦点图像融合任务的难点在于如何在融合图像中准确保留源图像的互补信息。然而,现有的数据集往往缺乏地面真实图像,使得一些全参考损失函数(如 SSIM)难以有效参与模型训练,从而进一步影响了保留源图像细节的性能。针对这一问题,本文提出了一种用于多焦点图像融合的无监督双通道密集卷积方法 DCD。DCD 专门为融合任务设计了 "补丁 "处理块,将源图像对分割成大小相等的补丁,并对其信息进行评估,从而获得重建图像和一组自适应权重系数。重建后的图像被用作参考图像,从而使无监督方法能够在训练中使用全参考损失函数,并克服了训练集中缺乏标记数据的难题。此外,考虑到人类视觉系统(HVS)对亮度的敏感度高于对色彩的敏感度,DCD 使用 RGB 图像及其亮度分量来训练双通道网络。这使得网络在保留源图像的色彩和梯度细节的同时,更加关注亮度信息,从而生成更符合 HVS 的融合图像。通过 "补丁 "处理块获得的自适应权重系数也用于确定源图像中亮度信息的保留程度。最后,不同数据集的对比实验也证明,与其他方法相比,DCD 在融合图像质量方面表现出色。
{"title":"An unsupervised multi-focus image fusion method via dual-channel convolutional network and discriminator","authors":"Lixing Fang ,&nbsp;Xiangxiang Wang ,&nbsp;Junli Zhao ,&nbsp;Zhenkuan Pan ,&nbsp;Hui Li ,&nbsp;Yi Li","doi":"10.1016/j.cviu.2024.104029","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104029","url":null,"abstract":"<div><p>The challenge in multi-focus image fusion tasks lies in accurately preserving the complementary information from the source images in the fused image. However, existing datasets often lack ground truth images, making it difficult for some full-reference loss functions (such as SSIM) to effectively participate in model training, thereby further affecting the performance of retaining source image details. To address this issue, this paper proposes an unsupervised dual-channel dense convolutional method, DCD, for multi-focus image fusion. DCD designs Patch processing blocks specifically for the fusion task, which segment the source image pairs into equally sized patches and evaluate their information to obtain a reconstructed image and a set of adaptive weight coefficients. The reconstructed image is used as the reference image, enabling unsupervised methods to utilize full-reference loss functions in training and overcoming the challenge of lacking labeled data in the training set. Furthermore, considering that the human visual system (HVS) is more sensitive to brightness than color, DCD trains the dual-channel network using both RGB images and their luminance components. This allows the network to focus more on the brightness information while preserving the color and gradient details of the source images, resulting in fused images that are more compatible with the HVS. The adaptive weight coefficients obtained through the Patch processing blocks are also used to determine the degree of preservation of the brightness information in the source images. Finally, comparative experiments on different datasets also demonstrate the superior performance of DCD in terms of fused image quality compared to other methods.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140880051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight all-focused light field rendering 轻量级全聚焦光场渲染
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-04-27 DOI: 10.1016/j.cviu.2024.104031
Tomáš Chlubna , Tomáš Milet , Pavel Zemčík

This paper proposes a novel real-time method for high-quality view interpolation from light field. The proposal is a lightweight method, which can be used with consumer GPU, reaching same or better quality than existing methods, in a shorter time, with significantly smaller memory requirements. Light field belongs to image-based rendering methods that can produce realistic images without computationally demanding algorithms. The novel view is synthesized from multiple input images of the same scene, captured at different camera positions. Standard rendering techniques, such as rasterization or ray-tracing, are limited in terms of quality, memory footprint, and speed. Light field rendering methods often produce unwanted artifacts resembling ghosting or blur in certain parts of the scene due to unknown geometry of the scene. The proposed method estimates the geometry for each pixel as an optimal focusing distance to mitigate the artifacts. The focusing distance determines which pixels from the input images are mixed to produce the final view. State-of-the-art methods use a constant-step pixel matching scan that iterates over a range of focusing distances. The scan searches for a distance with the smallest color dispersion of the contributing pixels, assuming that they belong to the same spot in the scene. The paper proposes an optimal scanning strategy of the focusing range, an improved color dispersion metric, and other minor improvements, such as sampling block size adjustment, out-of-bounds sampling, and filtering. Experimental results show that the proposal uses less resources, achieves better visual quality, and is significantly faster than existing light field rendering methods. The proposal is 8× faster than the methods in the same category. The proposal uses only four closest views from the light field data and reduces the necessary data transfer. Existing methods often require the full light field grid, which is typically 8 × 8 images large. Additionally, a new 4K light field dataset, containing scenes of various types, was created and published. An optimal novel method for light field acquisition is also proposed and used to create the dataset.

本文提出了一种从光场进行高质量视图插值的新型实时方法。该建议是一种轻量级方法,可与消费级 GPU 配合使用,在更短的时间内达到与现有方法相同或更高的质量,内存需求也大大降低。光场属于基于图像的渲染方法,无需高计算要求的算法就能生成逼真的图像。新颖的视图是由同一场景的多幅输入图像合成的,这些图像是在不同的摄像机位置拍摄的。光栅化或光线追踪等标准渲染技术在质量、内存占用和速度方面都受到限制。由于场景的几何形状未知,光场渲染方法通常会在场景的某些部分产生类似鬼影或模糊的不想要的伪影。所提出的方法将每个像素的几何形状估算为最佳聚焦距离,以减少伪影。对焦距离决定了将输入图像中的哪些像素进行混合,以生成最终视图。最先进的方法使用恒定步长的像素匹配扫描,在一系列对焦距离中进行迭代。扫描时,假设输入像素属于场景中的同一个点,则会寻找一个像素颜色离散度最小的距离。本文提出了聚焦范围的最佳扫描策略、改进的色彩色散度量,以及其他一些小的改进,如采样块大小调整、界外采样和过滤。实验结果表明,与现有的光场渲染方法相比,该建议使用了更少的资源,实现了更好的视觉质量,而且速度明显更快。该方案比同类方法快 8 倍。该方案只使用光场数据中的四个最近视图,减少了必要的数据传输。现有方法通常需要完整的光场网格,而这通常需要 8 × 8 幅图像。此外,还创建并发布了一个新的 4K 光场数据集,其中包含各种类型的场景。此外,还提出了一种最佳的光场采集新方法,并用于创建数据集。
{"title":"Lightweight all-focused light field rendering","authors":"Tomáš Chlubna ,&nbsp;Tomáš Milet ,&nbsp;Pavel Zemčík","doi":"10.1016/j.cviu.2024.104031","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104031","url":null,"abstract":"<div><p>This paper proposes a novel real-time method for high-quality view interpolation from light field. The proposal is a lightweight method, which can be used with consumer GPU, reaching same or better quality than existing methods, in a shorter time, with significantly smaller memory requirements. Light field belongs to image-based rendering methods that can produce realistic images without computationally demanding algorithms. The novel view is synthesized from multiple input images of the same scene, captured at different camera positions. Standard rendering techniques, such as rasterization or ray-tracing, are limited in terms of quality, memory footprint, and speed. Light field rendering methods often produce unwanted artifacts resembling ghosting or blur in certain parts of the scene due to unknown geometry of the scene. The proposed method estimates the geometry for each pixel as an optimal focusing distance to mitigate the artifacts. The focusing distance determines which pixels from the input images are mixed to produce the final view. State-of-the-art methods use a constant-step pixel matching scan that iterates over a range of focusing distances. The scan searches for a distance with the smallest color dispersion of the contributing pixels, assuming that they belong to the same spot in the scene. The paper proposes an optimal scanning strategy of the focusing range, an improved color dispersion metric, and other minor improvements, such as sampling block size adjustment, out-of-bounds sampling, and filtering. Experimental results show that the proposal uses less resources, achieves better visual quality, and is significantly faster than existing light field rendering methods. The proposal is <span><math><mrow><mn>8</mn><mo>×</mo></mrow></math></span> faster than the methods in the same category. The proposal uses only four closest views from the light field data and reduces the necessary data transfer. Existing methods often require the full light field grid, which is typically 8 × 8 images large. Additionally, a new 4K light field dataset, containing scenes of various types, was created and published. An optimal novel method for light field acquisition is also proposed and used to create the dataset.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140825344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conditioning diffusion models via attributes and semantic masks for face generation 通过属性和语义掩码调节扩散模型以生成人脸
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-04-27 DOI: 10.1016/j.cviu.2024.104026
Giuseppe Lisanti, Nico Giambi

Deep generative models have shown impressive results in generating realistic images of faces. GANs managed to generate high-quality, high-fidelity images when conditioned on semantic masks, but they still lack the ability to diversify their output. Diffusion models partially solve this problem and are able to generate diverse samples given the same condition. This paper introduces a novel strategy for enhancing diffusion models through multi-conditioning, harnessing cross-attention mechanisms to utilize multiple feature sets, ultimately enabling the generation of high-quality and controllable images. The proposed method extends previous approaches by introducing conditioning on both attributes and semantic masks, ensuring finer control over the generated face images. In order to improve the training time and the generation quality, the impact of applying perceptual-focused loss weighting into the latent space instead of the pixel space is also investigated. The proposed solution has been evaluated on the CelebA-HQ dataset, and it can generate realistic and diverse samples while allowing for fine-grained control over multiple attributes and semantic regions. Experiments on the DeepFashion dataset have also been performed in order to analyze the capability of the proposed model to generalize to different domains. In addition, an ablation study has been conducted to evaluate the impact of different conditioning strategies on the quality and diversity of the generated images.

深度生成模型在生成逼真的人脸图像方面取得了令人瞩目的成果。当以语义掩码为条件时,GANs 能够生成高质量、高逼真度的图像,但它们仍然缺乏多样化输出的能力。扩散模型部分解决了这一问题,能够在相同条件下生成多样化的样本。本文介绍了一种通过多条件增强扩散模型的新策略,利用交叉注意机制来利用多个特征集,最终生成高质量和可控的图像。本文提出的方法扩展了以往的方法,引入了对属性和语义掩码的调节,确保对生成的人脸图像进行更精细的控制。为了缩短训练时间并提高生成质量,还研究了在潜空间而不是像素空间应用以感知为重点的损失加权的影响。我们在 CelebA-HQ 数据集上对所提出的解决方案进行了评估,结果表明它可以生成真实、多样的样本,同时允许对多个属性和语义区域进行精细控制。此外,还在 DeepFashion 数据集上进行了实验,以分析拟议模型在不同领域的通用能力。此外,还进行了一项消融研究,以评估不同调节策略对生成图像的质量和多样性的影响。
{"title":"Conditioning diffusion models via attributes and semantic masks for face generation","authors":"Giuseppe Lisanti,&nbsp;Nico Giambi","doi":"10.1016/j.cviu.2024.104026","DOIUrl":"https://doi.org/10.1016/j.cviu.2024.104026","url":null,"abstract":"<div><p>Deep generative models have shown impressive results in generating realistic images of faces. GANs managed to generate high-quality, high-fidelity images when conditioned on semantic masks, but they still lack the ability to diversify their output. Diffusion models partially solve this problem and are able to generate diverse samples given the same condition. This paper introduces a novel strategy for enhancing diffusion models through multi-conditioning, harnessing cross-attention mechanisms to utilize multiple feature sets, ultimately enabling the generation of high-quality and controllable images. The proposed method extends previous approaches by introducing conditioning on both attributes and semantic masks, ensuring finer control over the generated face images. In order to improve the training time and the generation quality, the impact of applying perceptual-focused loss weighting into the latent space instead of the pixel space is also investigated. The proposed solution has been evaluated on the CelebA-HQ dataset, and it can generate realistic and diverse samples while allowing for fine-grained control over multiple attributes and semantic regions. Experiments on the DeepFashion dataset have also been performed in order to analyze the capability of the proposed model to generalize to different domains. In addition, an ablation study has been conducted to evaluate the impact of different conditioning strategies on the quality and diversity of the generated images.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001073/pdfft?md5=72f1d087600c3806c03661cd66fb5a1d&pid=1-s2.0-S1077314224001073-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140901271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
De2Net: Under-display camera image restoration with feature deconvolution and kernel decomposition De2Net:利用特征解卷积和内核分解修复显示不足的摄像头图像
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-04-25 DOI: 10.1016/j.cviu.2024.104028
Hangyan Zhu, Shaohui Liu, Ming Liu, Zifei Yan, Wangmeng Zuo

While the under-display camera (UDC) system provides an effective solution for notch-free full-screen displays, it inevitably causes severe image quality degradation due to the diffraction phenomenon. Recent methods have achieved decent performance with deep neural networks, yet the characteristic of the point spread function (PSF) is less studied. In this paper, considering the large support and spatial inconsistency of PSF, we propose De2Net for UDC image restoration with feature deconvolution and kernel decomposition. In terms of feature deconvolution, we introduce Wiener deconvolution as a preliminary process, which alleviates feature entanglement caused by the large PSF support. Besides, the deconvolution kernel can be learned from training images, eliminating the tedious PSF-obtaining process. As for kernel decomposition, we observe regular patterns for PSFs at different positions. Thus, with a kernel prediction network (KPN) deployed for handling the spatial inconsistency problem, we improve it from two aspects, i.e., (i) decomposing the predicted kernels into a set of bases and weights, (ii) decomposing kernels into groups with different dilation rates. These modifications largely improve the receptive field under certain memory limits. Extensive experiments on three commonly used UDC datasets show that De2Net outperforms existing methods both quantitatively and qualitatively. Source code and pre-trained models are available at https://github.com/HyZhu39/De2Net.

虽然显示屏下摄像头(UDC)系统为无凹槽全屏显示提供了有效的解决方案,但由于衍射现象,它不可避免地会导致严重的图像质量下降。最近的方法利用深度神经网络取得了不错的性能,但对点扩散函数(PSF)的特性研究较少。本文考虑到 PSF 的大支持度和空间不一致性,提出了利用特征解卷积和核分解实现 UDC 图像修复的 De2Net 方法。在特征解卷积方面,我们引入了维纳解卷积作为初步处理,缓解了 PSF 支持率过大导致的特征纠缠。此外,解卷积核可以从训练图像中学习,省去了繁琐的 PSF 获取过程。在核分解方面,我们观察到不同位置的 PSF 都有规律可循。因此,利用内核预测网络(KPN)来处理空间不一致性问题,我们从两个方面对其进行了改进,即:(i)将预测内核分解为一组基数和权重;(ii)将内核分解为具有不同扩张率的组。在一定的内存限制下,这些修改在很大程度上改善了感受野。在三个常用的 UDC 数据集上进行的广泛实验表明,De2Net 在数量和质量上都优于现有方法。源代码和预训练模型可从 https://github.com/HyZhu39/De2Net 获取。
{"title":"De2Net: Under-display camera image restoration with feature deconvolution and kernel decomposition","authors":"Hangyan Zhu,&nbsp;Shaohui Liu,&nbsp;Ming Liu,&nbsp;Zifei Yan,&nbsp;Wangmeng Zuo","doi":"10.1016/j.cviu.2024.104028","DOIUrl":"10.1016/j.cviu.2024.104028","url":null,"abstract":"<div><p>While the under-display camera (UDC) system provides an effective solution for notch-free full-screen displays, it inevitably causes severe image quality degradation due to the diffraction phenomenon. Recent methods have achieved decent performance with deep neural networks, yet the characteristic of the point spread function (PSF) is less studied. In this paper, considering the large support and spatial inconsistency of PSF, we propose De<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>Net for UDC image restoration with feature <em>de</em>convolution and kernel <em>de</em>composition. In terms of feature deconvolution, we introduce Wiener deconvolution as a preliminary process, which alleviates feature entanglement caused by the large PSF support. Besides, the deconvolution kernel can be learned from training images, eliminating the tedious PSF-obtaining process. As for kernel decomposition, we observe regular patterns for PSFs at different positions. Thus, with a kernel prediction network (KPN) deployed for handling the spatial inconsistency problem, we improve it from two aspects, <em>i.e.</em>, (i) decomposing the predicted kernels into a set of bases and weights, (ii) decomposing kernels into groups with different dilation rates. These modifications largely improve the receptive field under certain memory limits. Extensive experiments on three commonly used UDC datasets show that De<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>Net outperforms existing methods both quantitatively and qualitatively. Source code and pre-trained models are available at <span>https://github.com/HyZhu39/De2Net</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140784919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure-aware feature stylization for domain generalization 结构感知特征风格化促进领域泛化
IF 4.5 3区 计算机科学 Q1 Computer Science Pub Date : 2024-04-22 DOI: 10.1016/j.cviu.2024.104016
Milad Cheraghalikhani , Mehrdad Noori , David Osowiechi, Gustavo A. Vargas Hakim, Ismail Ben Ayed, Christian Desrosiers

Generalizing to out-of-distribution (OOD) data is a challenging task for existing deep learning approaches. This problem largely comes from the common but often incorrect assumption of statistical learning algorithms that the source and target data come from the same i.i.d. distribution. To tackle the limited variability of domains available during training, as well as domain shifts at test time, numerous approaches for domain generalization have focused on generating samples from new domains. Recent studies on this topic suggest that feature statistics from instances of different domains can be mixed to simulate synthesized images from a novel domain. While this simple idea achieves state-of-art results on various domain generalization benchmarks, it ignores structural information which is key to transferring knowledge across different domains. In this paper, we leverage the ability of humans to recognize objects using solely their structural information (prominent region contours) to design a Structural-Aware Feature Stylization method for domain generalization. Our method improves feature stylization based on mixing instance statistics by enforcing structural consistency across the different style-augmented samples. This is achieved via a multi-task learning model which classifies original and augmented images while also reconstructing their edges in a secondary task. The edge reconstruction task helps the network preserve image structure during feature stylization, while also acting as a regularizer for the classification task. Through quantitative comparisons, we verify the effectiveness of our method upon existing state-of-the-art methods on PACS, VLCS, OfficeHome, DomainNet and Digits-DG. The implementation is available at this repository.

对于现有的深度学习方法来说,泛化分布外(OOD)数据是一项具有挑战性的任务。这个问题主要源于统计学习算法中常见但往往不正确的假设,即源数据和目标数据来自相同的 i.i.d. 分布。为了解决训练期间可用领域的有限可变性,以及测试时领域的变化,许多领域泛化方法都侧重于从新领域生成样本。最近有关这一主题的研究表明,来自不同领域实例的特征统计数据可以混合使用,以模拟来自新领域的合成图像。虽然这一简单的想法在各种领域泛化基准上取得了先进的结果,但它忽略了结构信息,而结构信息是跨领域知识转移的关键。在本文中,我们利用人类仅使用结构信息(突出的区域轮廓)识别物体的能力,设计了一种结构感知特征风格化方法,用于领域泛化。我们的方法通过强化不同风格增强样本的结构一致性,改进了基于混合实例统计的特征风格化。这是通过多任务学习模型实现的,该模型在对原始图像和增强图像进行分类的同时,还在次要任务中重建图像边缘。边缘重建任务有助于网络在特征风格化过程中保持图像结构,同时也是分类任务的正则化器。通过定量比较,我们在 PACS、VLCS、OfficeHome、DomainNet 和 Digits-DG 上验证了我们的方法对现有先进方法的有效性。该方法的实现可在此资源库中获取。
{"title":"Structure-aware feature stylization for domain generalization","authors":"Milad Cheraghalikhani ,&nbsp;Mehrdad Noori ,&nbsp;David Osowiechi,&nbsp;Gustavo A. Vargas Hakim,&nbsp;Ismail Ben Ayed,&nbsp;Christian Desrosiers","doi":"10.1016/j.cviu.2024.104016","DOIUrl":"10.1016/j.cviu.2024.104016","url":null,"abstract":"<div><p>Generalizing to out-of-distribution (OOD) data is a challenging task for existing deep learning approaches. This problem largely comes from the common but often incorrect assumption of statistical learning algorithms that the source and target data come from the same i.i.d. distribution. To tackle the limited variability of domains available during training, as well as domain shifts at test time, numerous approaches for domain generalization have focused on generating samples from new domains. Recent studies on this topic suggest that feature statistics from instances of different domains can be mixed to simulate synthesized images from a novel domain. While this simple idea achieves state-of-art results on various domain generalization benchmarks, it ignores structural information which is key to transferring knowledge across different domains. In this paper, we leverage the ability of humans to recognize objects using solely their structural information (prominent region contours) to design a Structural-Aware Feature Stylization method for domain generalization. Our method improves feature stylization based on mixing instance statistics by enforcing structural consistency across the different style-augmented samples. This is achieved via a multi-task learning model which classifies original and augmented images while also reconstructing their edges in a secondary task. The edge reconstruction task helps the network preserve image structure during feature stylization, while also acting as a regularizer for the classification task. Through quantitative comparisons, we verify the effectiveness of our method upon existing state-of-the-art methods on PACS, VLCS, OfficeHome, DomainNet and Digits-DG. The implementation is available at <span>this repository</span><svg><path></path></svg>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.5,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224000973/pdfft?md5=0d4d59f17473bf7f0dfdf40548b409ae&pid=1-s2.0-S1077314224000973-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140789044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1