Image denoising is a fundamental problem in computational photography, where achieving high perception with low distortion is highly demanding. Current methods either struggle with perceptual quality or suffer from significant distortion. Recently, the emerging diffusion model has achieved state-of-the-art performance in various tasks and demonstrates great potential for image denoising. However, stimulating diffusion models for image denoising is not straightforward and requires solving several critical problems. For one thing, the input inconsistency hinders the connection between diffusion models and image denoising. For another, the content inconsistency between the generated image and the desired denoised image introduces distortion. To tackle these problems, we present a novel strategy called the Diffusion Model for Image Denoising (DMID) by understanding and rethinking the diffusion model from a denoising perspective. Our DMID strategy includes an adaptive embedding method that embeds the noisy image into a pre-trained unconditional diffusion model and an adaptive ensembling method that reduces distortion in the denoised image. Our DMID strategy achieves state-of-the-art performance on both distortion-based and perception-based metrics, for both Gaussian and real-world image denoising. The code is available at https://github.com/Li-Tong-621/DMID.
{"title":"Stimulating Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling.","authors":"Tong Li, Hansen Feng, Lizhi Wang, Lin Zhu, Zhiwei Xiong, Hua Huang","doi":"10.1109/TPAMI.2024.3432812","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432812","url":null,"abstract":"<p><p>Image denoising is a fundamental problem in computational photography, where achieving high perception with low distortion is highly demanding. Current methods either struggle with perceptual quality or suffer from significant distortion. Recently, the emerging diffusion model has achieved state-of-the-art performance in various tasks and demonstrates great potential for image denoising. However, stimulating diffusion models for image denoising is not straightforward and requires solving several critical problems. For one thing, the input inconsistency hinders the connection between diffusion models and image denoising. For another, the content inconsistency between the generated image and the desired denoised image introduces distortion. To tackle these problems, we present a novel strategy called the Diffusion Model for Image Denoising (DMID) by understanding and rethinking the diffusion model from a denoising perspective. Our DMID strategy includes an adaptive embedding method that embeds the noisy image into a pre-trained unconditional diffusion model and an adaptive ensembling method that reduces distortion in the denoised image. Our DMID strategy achieves state-of-the-art performance on both distortion-based and perception-based metrics, for both Gaussian and real-world image denoising. The code is available at https://github.com/Li-Tong-621/DMID.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1109/TPAMI.2024.3432651
Zongsheng Yue, Chen Change Loy
While deep learning-based methods for blind face restoration have achieved unprecedented success, they still suffer from two major limitations. First, most of them deteriorate when facing complex degradations out of their training data. Second, these methods require multiple constraints, e.g., fidelity, perceptual, and adversarial losses, which require laborious hyper-parameter tuning to stabilize and balance their influences. In this work, we propose a novel method named DifFace that is capable of coping with unseen and complex degradations more gracefully without complicated loss designs. The key of our method is to establish a posterior distribution from the observed low-quality (LQ) image to its high-quality (HQ) counterpart. In particular, we design a transition distribution from the LQ image to the intermediate state of a pre-trained diffusion model and then gradually transmit from this intermediate state to the HQ target by recursively applying a pre-trained diffusion model. The transition distribution only relies on a restoration backbone that is trained with L1 loss on some synthetic data, which favorably avoids the cumbersome training process in existing methods. Moreover, the transition distribution can contract the error of the restoration backbone and thus makes our method more robust to unknown degradations. Comprehensive experiments show that DifFace is superior to current state-of-the-art methods, especially in cases with severe degradations. Code and model are available at https://github.com/zsyOAOA/DifFace.
{"title":"DifFace: Blind Face Restoration with Diffused Error Contraction.","authors":"Zongsheng Yue, Chen Change Loy","doi":"10.1109/TPAMI.2024.3432651","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432651","url":null,"abstract":"<p><p>While deep learning-based methods for blind face restoration have achieved unprecedented success, they still suffer from two major limitations. First, most of them deteriorate when facing complex degradations out of their training data. Second, these methods require multiple constraints, e.g., fidelity, perceptual, and adversarial losses, which require laborious hyper-parameter tuning to stabilize and balance their influences. In this work, we propose a novel method named DifFace that is capable of coping with unseen and complex degradations more gracefully without complicated loss designs. The key of our method is to establish a posterior distribution from the observed low-quality (LQ) image to its high-quality (HQ) counterpart. In particular, we design a transition distribution from the LQ image to the intermediate state of a pre-trained diffusion model and then gradually transmit from this intermediate state to the HQ target by recursively applying a pre-trained diffusion model. The transition distribution only relies on a restoration backbone that is trained with L<sub>1</sub> loss on some synthetic data, which favorably avoids the cumbersome training process in existing methods. Moreover, the transition distribution can contract the error of the restoration backbone and thus makes our method more robust to unknown degradations. Comprehensive experiments show that DifFace is superior to current state-of-the-art methods, especially in cases with severe degradations. Code and model are available at https://github.com/zsyOAOA/DifFace.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), which formulate many important real-world applications. Cut selection heavily depends on (P1) which cuts to prefer and (P2) how many cuts to select. Although modern MILP solvers tackle (P1)-(P2) by human-designed heuristics, machine learning carries the potential to learn more effective heuristics. However, many existing learning-based methods learn which cuts to prefer, neglecting the importance of learning how many cuts to select. Moreover, we observe that (P3) what order of selected cuts to prefer significantly impacts the efficiency of MILP solvers as well. To address these challenges, we propose a novel hierarchical sequence/set model (HEM) to learn cut selection policies. Specifically, HEM is a bi-level model: (1) a higher-level module that learns how many cuts to select, (2) and a lower-level module-that formulates the cut selection as a sequence/set to sequence learning problem-to learn policies selecting an ordered subset with the cardinality determined by the higher-level module. To the best of our knowledge, HEM is the first data-driven methodology that well tackles (P1)-(P3) simultaneously. Experiments demonstrate that HEM significantly improves the efficiency of solving MILPs on eleven challenging MILP benchmarks, including two Huawei's real problems.
{"title":"Learning to Cut via Hierarchical Sequence/Set Model for Efficient Mixed-Integer Programming.","authors":"Jie Wang, Zhihai Wang, Xijun Li, Yufei Kuang, Zhihao Shi, Fangzhou Zhu, Mingxuan Yuan, Jia Zeng, Yongdong Zhang, Feng Wu","doi":"10.1109/TPAMI.2024.3432716","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432716","url":null,"abstract":"<p><p>Cutting planes (cuts) play an important role in solving mixed-integer linear programs (MILPs), which formulate many important real-world applications. Cut selection heavily depends on (P1) which cuts to prefer and (P2) how many cuts to select. Although modern MILP solvers tackle (P1)-(P2) by human-designed heuristics, machine learning carries the potential to learn more effective heuristics. However, many existing learning-based methods learn which cuts to prefer, neglecting the importance of learning how many cuts to select. Moreover, we observe that (P3) what order of selected cuts to prefer significantly impacts the efficiency of MILP solvers as well. To address these challenges, we propose a novel hierarchical sequence/set model (HEM) to learn cut selection policies. Specifically, HEM is a bi-level model: (1) a higher-level module that learns how many cuts to select, (2) and a lower-level module-that formulates the cut selection as a sequence/set to sequence learning problem-to learn policies selecting an ordered subset with the cardinality determined by the higher-level module. To the best of our knowledge, HEM is the first data-driven methodology that well tackles (P1)-(P3) simultaneously. Experiments demonstrate that HEM significantly improves the efficiency of solving MILPs on eleven challenging MILP benchmarks, including two Huawei's real problems.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Active learning (AL) is to design label-efficient algorithms by labeling the most representative samples. It reduces annotation cost and attracts increasing attention from the community. However, previous AL methods suffer from the inadequacy of annotations and unreliable uncertainty estimation. Moreover, we find that they ignore the intra-diversity of selected samples, which leads to sampling redundancy. In view of these challenges, we propose an inductive state-relabeling adversarial AL model (ISRA) that consists of a unified representation generator, an inductive state-relabeling discriminator, and a heuristic clique rescaling module. The generator introduces contrastive learning to leverage unlabeled samples for self-supervised training, where the mutual information is utilized to improve the representation quality for AL selection. Then, we design an inductive uncertainty indicator to learn the state score from labeled data and relabel unlabeled data with different importance for better discrimination of instructive samples. To solve the problem of sampling redundancy, the heuristic clique rescaling module measures the intra-diversity of candidate samples and recurrently rescales them to select the most informative samples. The experiments conducted on eight datasets and two imbalanced scenarios show that our model outperforms the previous state-of-the-art AL methods. As an extension on the cross-modal AL task, we apply ISRA to the image captioning and it also achieves superior performance.
主动学习(AL)是通过标注最具代表性的样本来设计标签效率高的算法。它降低了标注成本,受到越来越多的关注。然而,以往的主动学习方法存在注释不足和不确定性估计不可靠的问题。此外,我们还发现这些方法忽略了所选样本的内部多样性,从而导致了采样冗余。鉴于这些挑战,我们提出了一种归纳式状态标注对抗 AL 模型(ISRA),它由一个统一的表示生成器、一个归纳式状态标注判别器和一个启发式聚类重缩模块组成。表示生成器引入了对比学习,利用未标记样本进行自我监督训练,利用互信息来提高 AL 选择的表示质量。然后,我们设计了一个归纳式不确定性指标,从已标注数据中学习状态得分,并对未标注数据进行不同重要性的重新标注,以更好地辨别指导性样本。为解决采样冗余问题,启发式聚类重定标模块会测量候选样本的内部多样性,并对其进行循环重定标,以选择信息量最大的样本。在八个数据集和两个不平衡场景下进行的实验表明,我们的模型优于之前最先进的 AL 方法。作为跨模态 AL 任务的扩展,我们将 ISRA 应用于图像字幕,同样取得了优异的性能。
{"title":"Inductive State-Relabeling Adversarial Active Learning with Heuristic Clique Rescaling.","authors":"Beichen Zhang, Liang Li, Shuhui Wang, Shaofei Cai, Zheng-Jun Zha, Qi Tian, Qingming Huang","doi":"10.1109/TPAMI.2024.3432099","DOIUrl":"10.1109/TPAMI.2024.3432099","url":null,"abstract":"<p><p>Active learning (AL) is to design label-efficient algorithms by labeling the most representative samples. It reduces annotation cost and attracts increasing attention from the community. However, previous AL methods suffer from the inadequacy of annotations and unreliable uncertainty estimation. Moreover, we find that they ignore the intra-diversity of selected samples, which leads to sampling redundancy. In view of these challenges, we propose an inductive state-relabeling adversarial AL model (ISRA) that consists of a unified representation generator, an inductive state-relabeling discriminator, and a heuristic clique rescaling module. The generator introduces contrastive learning to leverage unlabeled samples for self-supervised training, where the mutual information is utilized to improve the representation quality for AL selection. Then, we design an inductive uncertainty indicator to learn the state score from labeled data and relabel unlabeled data with different importance for better discrimination of instructive samples. To solve the problem of sampling redundancy, the heuristic clique rescaling module measures the intra-diversity of candidate samples and recurrently rescales them to select the most informative samples. The experiments conducted on eight datasets and two imbalanced scenarios show that our model outperforms the previous state-of-the-art AL methods. As an extension on the cross-modal AL task, we apply ISRA to the image captioning and it also achieves superior performance.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1109/TPAMI.2024.3432631
Jen-Tzung Chien, Yu-Han Huang
Sequential learning using transformer has achieved state-of-the-art performance in natural language tasks and many others. The key to this success is the multi-head self attention which encodes and gathers the features from individual tokens of an input sequence. The mapping or decoding is performed to produce an output sequence via cross attention. There are threefold weaknesses by using such an attention framework. First, since the attention would mix up the features of different tokens in input and output sequences, it is likely that redundant information exists in sequence data representation. Second, the patterns of attention weights among different heads tend to be similar. The model capacity is bounded. Third, the robustness in an encoder-decoder network against the model uncertainty is disregarded. To handle these weaknesses, this paper presents a Bayesian semantic and disentangled mask attention to learn latent disentanglement in multi-head attention where the redundant features in transformer are compensated with the latent topic information. The attention weights are filtered by a mask which is optimized through semantic clustering. This attention mechanism is implemented according to Bayesian learning for clustered disentanglement. The experiments on machine translation and speech recognition show the merit of Bayesian clustered disentanglement for mask attention.
{"title":"Latent Semantic and Disentangled Attention.","authors":"Jen-Tzung Chien, Yu-Han Huang","doi":"10.1109/TPAMI.2024.3432631","DOIUrl":"10.1109/TPAMI.2024.3432631","url":null,"abstract":"<p><p>Sequential learning using transformer has achieved state-of-the-art performance in natural language tasks and many others. The key to this success is the multi-head self attention which encodes and gathers the features from individual tokens of an input sequence. The mapping or decoding is performed to produce an output sequence via cross attention. There are threefold weaknesses by using such an attention framework. First, since the attention would mix up the features of different tokens in input and output sequences, it is likely that redundant information exists in sequence data representation. Second, the patterns of attention weights among different heads tend to be similar. The model capacity is bounded. Third, the robustness in an encoder-decoder network against the model uncertainty is disregarded. To handle these weaknesses, this paper presents a Bayesian semantic and disentangled mask attention to learn latent disentanglement in multi-head attention where the redundant features in transformer are compensated with the latent topic information. The attention weights are filtered by a mask which is optimized through semantic clustering. This attention mechanism is implemented according to Bayesian learning for clustered disentanglement. The experiments on machine translation and speech recognition show the merit of Bayesian clustered disentanglement for mask attention.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern image editing software enables anyone to alter the content of an image to deceive the public, which can pose a security hazard to personal privacy and public safety. The detection and localization of image tampering is becoming an urgent issue to be addressed. We have revealed that the tampered region exhibits homogenous differences (the changes in metadata organization form and organization structure of the image) from the real region after manipulations such as splicing, copy-move, and removal. Therefore, we propose a novel end-to-end network named HDF-Net to extract these homogeny difference features for precise localization of tampering artifacts. The HDF-Net is composed of RGB and SRM dual-stream networks, including three complementary modules, namely the suspicious tampering-artifact prominent (STP) module, the fine tampering-artifact salient (FTS) module, and the tampering-artifact edge refined (TER) module. We utilize the fully attentional block (FLA) to enhance the characterization ability of homogeny difference features extracted by each module and preserve the specifics of tampering artifacts. These modules are gradually merged according to the strategy of "coarse-fine-finer", which significantly improves the localization accuracy and edge refinement. Extensive experiments demonstrate that HDF-Net performs better than state-of-the-art tampering localization models on five benchmarks, achieving satisfactory generalization and robustness. Code can be found at https://github.com/ruidonghan/HDF-Net/.
{"title":"HDF-Net: Capturing Homogeny Difference Features to Localize the Tampered Image.","authors":"Ruidong Han, Xiaofeng Wang, Ningning Bai, Yihang Wang, Jianpeng Hou, Jianru Xue","doi":"10.1109/TPAMI.2024.3432551","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432551","url":null,"abstract":"<p><p>Modern image editing software enables anyone to alter the content of an image to deceive the public, which can pose a security hazard to personal privacy and public safety. The detection and localization of image tampering is becoming an urgent issue to be addressed. We have revealed that the tampered region exhibits homogenous differences (the changes in metadata organization form and organization structure of the image) from the real region after manipulations such as splicing, copy-move, and removal. Therefore, we propose a novel end-to-end network named HDF-Net to extract these homogeny difference features for precise localization of tampering artifacts. The HDF-Net is composed of RGB and SRM dual-stream networks, including three complementary modules, namely the suspicious tampering-artifact prominent (STP) module, the fine tampering-artifact salient (FTS) module, and the tampering-artifact edge refined (TER) module. We utilize the fully attentional block (FLA) to enhance the characterization ability of homogeny difference features extracted by each module and preserve the specifics of tampering artifacts. These modules are gradually merged according to the strategy of \"coarse-fine-finer\", which significantly improves the localization accuracy and edge refinement. Extensive experiments demonstrate that HDF-Net performs better than state-of-the-art tampering localization models on five benchmarks, achieving satisfactory generalization and robustness. Code can be found at https://github.com/ruidonghan/HDF-Net/.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1109/TPAMI.2024.3432326
Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang
The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.
利用稀缺的像素级注释进行语义分割所面临的挑战引发了许多自监督工作,但其中大多数工作基本上都是训练图像编码器或分割头,以生成更精细的密集表示,而在执行分割推理时,它们需要借助监督线性分类器或传统聚类。通过数据集级聚类进行分割不仅偏离了实时和端到端的推理实践,而且将问题从每幅图像的分割升级为一次性对所有像素进行聚类,从而导致性能下降。为了解决这个问题,我们提出了一种新颖的自监督语义分割训练和推理范例,在这种范例中,推理是以端到端的方式进行的。具体来说,根据我们在图像级自监督 ViT 的密集表征探测中观察到的问题,即斑块之间的语义不一致和非倾斜区域的语义质量差,我们提出了原型-图像对齐和全局-局部对齐的注意图约束,用可学习的原型来训练定制的变换器解码器,并利用自适应原型进行每幅图像的分割推理。在完全无监督的语义分割设置下进行的大量实验证明了我们提出的方法具有卓越的性能和通用性。代码见:https://github.com/yliu1229/AlignSeg。
{"title":"Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation.","authors":"Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang","doi":"10.1109/TPAMI.2024.3432326","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432326","url":null,"abstract":"<p><p>The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1109/TPAMI.2024.3432529
Ran Yi, Teng Hu, Mengfei Xia, Yizhe Tang, Yong-Jin Liu
Generative Adversarial Networks have achieved significant advancements in generating and editing high-resolution images. However, most methods suffer from either requiring extensive labeled datasets or strong prior knowledge. It is also challenging for them to disentangle correlated attributes with few-shot data. In this paper, we propose FEditNet++, a GAN-based approach to explore latent semantics. It aims to enable attribute editing with limited labeled data and disentangle the correlated attributes. We propose a layer-wise feature contrastive objective, which takes into consideration content consistency and facilitates the invariance of the unrelated attributes before and after editing. Furthermore, we harness the knowledge from the pretrained discriminative model to prevent overfitting. In particular, to solve the entanglement problem between the correlated attributes from data and semantic latent correlation, we extend our model to jointly optimize multiple attributes and propose a novel decoupling loss and cross-assessment loss to disentangle them from both latent and image space. We further propose a novel-attribute disentanglement strategy to enable editing of novel attributes with unknown entanglements. Finally, we extend our model to accurately edit the fine-grained attributes. Qualitative and quantitative assessments demonstrate that our method outperforms state-of-the-art approaches across various datasets, including CelebA-HQ, RaFD, Danbooru2018 and LSUN Church.
生成对抗网络在生成和编辑高分辨率图像方面取得了重大进展。然而,大多数方法都需要大量标记数据集或强大的先验知识。此外,这些方法还很难用少量数据来区分相关属性。在本文中,我们提出了一种基于 GAN 的潜在语义探索方法 FEditNet++。它旨在利用有限的标注数据进行属性编辑,并分离相关属性。我们提出了一种分层特征对比目标,它考虑到了内容的一致性,并促进了无关属性在编辑前后的不变性。此外,我们还利用预训练判别模型的知识来防止过拟合。特别是,为了解决数据相关属性与语义潜在相关性之间的纠缠问题,我们扩展了模型以联合优化多个属性,并提出了一种新颖的解耦损失和交叉评估损失,以将它们从潜在空间和图像空间中分离出来。我们进一步提出了一种新颖的属性解缠策略,以便编辑具有未知纠缠的新颖属性。最后,我们扩展了模型,以准确编辑细粒度属性。定性和定量评估表明,我们的方法在各种数据集(包括 CelebA-HQ、RaFD、Danbooru2018 和 LSUN Church)上的表现优于最先进的方法。
{"title":"FEditNet++: Few-Shot Editing of Latent Semantics in GAN Spaces with Correlated Attribute Disentanglement.","authors":"Ran Yi, Teng Hu, Mengfei Xia, Yizhe Tang, Yong-Jin Liu","doi":"10.1109/TPAMI.2024.3432529","DOIUrl":"10.1109/TPAMI.2024.3432529","url":null,"abstract":"<p><p>Generative Adversarial Networks have achieved significant advancements in generating and editing high-resolution images. However, most methods suffer from either requiring extensive labeled datasets or strong prior knowledge. It is also challenging for them to disentangle correlated attributes with few-shot data. In this paper, we propose FEditNet++, a GAN-based approach to explore latent semantics. It aims to enable attribute editing with limited labeled data and disentangle the correlated attributes. We propose a layer-wise feature contrastive objective, which takes into consideration content consistency and facilitates the invariance of the unrelated attributes before and after editing. Furthermore, we harness the knowledge from the pretrained discriminative model to prevent overfitting. In particular, to solve the entanglement problem between the correlated attributes from data and semantic latent correlation, we extend our model to jointly optimize multiple attributes and propose a novel decoupling loss and cross-assessment loss to disentangle them from both latent and image space. We further propose a novel-attribute disentanglement strategy to enable editing of novel attributes with unknown entanglements. Finally, we extend our model to accurately edit the fine-grained attributes. Qualitative and quantitative assessments demonstrate that our method outperforms state-of-the-art approaches across various datasets, including CelebA-HQ, RaFD, Danbooru2018 and LSUN Church.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1109/TPAMI.2024.3432308
Yuhui Wu, Guoqing Wang, Shaochong Liu, Yang Yang, Wei Li, Xiongxin Tang, Shuhang Gu, Chongyi Li, Heng Tao Shen
Low-light image enhancement (LLIE) investigates how to improve the brightness of an image captured in illumination-insufficient environments. The majority of existing methods enhance low-light images in a global and uniform manner, without taking into account the semantic information of different regions. Consequently, a network may easily deviate from the original color of local regions. To address this issue, we propose a semantic-aware knowledge-guided framework (SKF) that can assist a low-light enhancement model in learning rich and diverse priors encapsulated in a semantic segmentation model. We concentrate on incorporating semantic knowledge from three key aspects: a semantic-aware embedding module that adaptively integrates semantic priors in feature representation space, a semantic-guided color histogram loss that preserves color consistency of various instances, and a semantic-guided adversarial loss that produces more natural textures by semantic priors. Our SKF is appealing in acting as a general framework in the LLIE task. We further present a refined framework SKF++ with two new techniques: (a) Extra convolutional branch for intra-class illumination and color recovery through extracting local information and (b) Equalization-based histogram transformation for contrast enhancement and high dynamic range adjustment. Extensive experiments on various benchmarks of LLIE task and other image processing tasks show that models equipped with the SKF/SKF++ significantly outperform the baselines and our SKF/SKF++ generalizes to different models and scenes well. Besides, the potential benefits of our method in face detection and semantic segmentation in low-light conditions are discussed. The code and pre-trained models have been publicly available at https://github.com/langmanbusi/Semantic-Aware-Low-Light-Image-Enhancement.
{"title":"Towards a Flexible Semantic Guided Model for Single Image Enhancement and Restoration.","authors":"Yuhui Wu, Guoqing Wang, Shaochong Liu, Yang Yang, Wei Li, Xiongxin Tang, Shuhang Gu, Chongyi Li, Heng Tao Shen","doi":"10.1109/TPAMI.2024.3432308","DOIUrl":"10.1109/TPAMI.2024.3432308","url":null,"abstract":"<p><p>Low-light image enhancement (LLIE) investigates how to improve the brightness of an image captured in illumination-insufficient environments. The majority of existing methods enhance low-light images in a global and uniform manner, without taking into account the semantic information of different regions. Consequently, a network may easily deviate from the original color of local regions. To address this issue, we propose a semantic-aware knowledge-guided framework (SKF) that can assist a low-light enhancement model in learning rich and diverse priors encapsulated in a semantic segmentation model. We concentrate on incorporating semantic knowledge from three key aspects: a semantic-aware embedding module that adaptively integrates semantic priors in feature representation space, a semantic-guided color histogram loss that preserves color consistency of various instances, and a semantic-guided adversarial loss that produces more natural textures by semantic priors. Our SKF is appealing in acting as a general framework in the LLIE task. We further present a refined framework SKF++ with two new techniques: (a) Extra convolutional branch for intra-class illumination and color recovery through extracting local information and (b) Equalization-based histogram transformation for contrast enhancement and high dynamic range adjustment. Extensive experiments on various benchmarks of LLIE task and other image processing tasks show that models equipped with the SKF/SKF++ significantly outperform the baselines and our SKF/SKF++ generalizes to different models and scenes well. Besides, the potential benefits of our method in face detection and semantic segmentation in low-light conditions are discussed. The code and pre-trained models have been publicly available at https://github.com/langmanbusi/Semantic-Aware-Low-Light-Image-Enhancement.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-23DOI: 10.1109/TPAMI.2024.3432552
Yan Huang, Yuming Wang, Yunan Zeng, Junshi Huang, Zhenhua Chai, Liang Wang
Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.
{"title":"Unpaired Image-text Matching via Multimodal Aligned Conceptual Knowledge.","authors":"Yan Huang, Yuming Wang, Yunan Zeng, Junshi Huang, Zhenhua Chai, Liang Wang","doi":"10.1109/TPAMI.2024.3432552","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3432552","url":null,"abstract":"<p><p>Recently, the accuracy of image-text matching has been greatly improved by multimodal pretrained models, all of which use millions or billions of paired images and texts for supervised model learning. Different from them, human brains can well match images with texts using their stored multimodal knowledge. Inspired by that, this paper studies a new scenario as unpaired image-text matching, in which paired images and texts are assumed to be unavailable during model learning. To deal with it, we accordingly propose a simple yet effective method namely Multimodal Aligned Conceptual Knowledge (MACK). First, we collect a set of words and their related image regions from publicly available datasets, and compute prototypical region representations to obtain pretrained general knowledge. To make the obtained knowledge better suit for certain datasets, we refine it using unpaired images and texts in a self-supervised learning manner to obtain fine-tuned domain knowledge. Then, to match given images with texts based on the knowledge, we represent parsed words in the texts by prototypical region representations, and compute region-word similarity scores. At last, the scores are aggregated based on bidirectional similarity pooling into an image-text similarity score, which can be directly used for unpaired image-text matching. The proposed MACK is complementary with existing models, which can be easily extended as a re-ranking method to substantially improve their performance of zero-shot and cross-dataset image-text matching.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141753665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}