By leveraging the blur-noise trade-off, imaging with non-uniform exposures largely extends the image acquisition flexibility in harsh environments. However, the limitation of conventional cameras in perceiving intra-frame dynamic information prevents existing methods from being implemented in the real-world frame acquisition for real-time adaptive camera shutter control. To address this challenge, we propose a novel Neuromorphic Shutter Control (NSC) system to avoid motion blur and alleviate instant noise, where the extremely low latency of events is leveraged to monitor the real-time motion and facilitate the scene-adaptive exposure. Furthermore, to stabilize the inconsistent Signal-to-Noise Ratio (SNR) caused by the non-uniform exposure times, we propose an event-based image denoising network within a self-supervised learning paradigm, i.e., SEID, exploring the statistics of image noise and inter-frame motion information of events to obtain artificial supervision signals for high-quality imaging in real-world scenes. To illustrate the effectiveness of the proposed NSC, we implement it in hardware by building a hybrid-camera imaging prototype system, with which we collect a real-world dataset containing well-synchronized frames and events in diverse scenarios with different target scenes and motion patterns. Experiments on the synthetic and real-world datasets demonstrate the superiority of our method over state-of-the-art approaches.
{"title":"Non-Uniform Exposure Imaging via Neuromorphic Shutter Control","authors":"Mingyuan Lin;Jian Liu;Chi Zhang;Zibo Zhao;Chu He;Lei Yu","doi":"10.1109/TPAMI.2025.3526280","DOIUrl":"10.1109/TPAMI.2025.3526280","url":null,"abstract":"By leveraging the blur-noise trade-off, imaging with non-uniform exposures largely extends the image acquisition flexibility in harsh environments. However, the limitation of conventional cameras in perceiving intra-frame dynamic information prevents existing methods from being implemented in the real-world frame acquisition for real-time adaptive camera shutter control. To address this challenge, we propose a novel Neuromorphic Shutter Control (NSC) system to avoid motion blur and alleviate instant noise, where the extremely low latency of events is leveraged to monitor the real-time motion and facilitate the scene-adaptive exposure. Furthermore, to stabilize the inconsistent Signal-to-Noise Ratio (SNR) caused by the non-uniform exposure times, we propose an event-based image denoising network within a self-supervised learning paradigm, i.e., SEID, exploring the statistics of image noise and inter-frame motion information of events to obtain artificial supervision signals for high-quality imaging in real-world scenes. To illustrate the effectiveness of the proposed NSC, we implement it in hardware by building a hybrid-camera imaging prototype system, with which we collect a real-world dataset containing well-synchronized frames and events in diverse scenarios with different target scenes and motion patterns. Experiments on the synthetic and real-world datasets demonstrate the superiority of our method over state-of-the-art approaches.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2770-2784"},"PeriodicalIF":0.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1109/TPAMI.2025.3526790
Jintian Ji;Songhe Feng
Tensorial Multi-view Clustering (TMC), a prominent approach in multi-view clustering, leverages low-rank tensor learning to capture high-order correlation among views for consistent clustering structure identification. Despite its promising performance, the TMC algorithms face three key challenges: 1). The severe computational burden makes it difficult for TMC methods to handle large-scale datasets. 2). Estimation bias problem caused by the convex surrogate of the tensor rank. 3). Lack of explicit balance of consistency and complementarity. Being aware of these, we propose a basic framework Efficient and Scalable Tensorial Multi-View Subspace Clustering (ESTMC) for large-scale multi-view clustering. ESTMC integrates anchor representation learning and non-convex function-based low-rank tensor learning with a Generalized Non-convex Tensor Rank (GNTR) into a unified objective function, which enhances the efficiency of the existing subspace-based TMC framework. Furthermore, a novel model ESTMC-C$^{2}$ with the proposed Enhanced Tensor Rank (ETR), Consistent Geometric Regularization (CGR), and Tensorial Exclusive Regularization (TER) is extended to balance the learning of consistency and complementarity among views, delivering divisible representations for the clustering task. Efficient iterative optimization algorithms are designed to solve the proposed ESTMC and ESTMC-C$^{2}$, which enjoy time-economical complexity and exhibit theoretical convergence. Extensive experimental results on various datasets demonstrate the superiority of the proposed algorithms as compared to state-of-the-art methods.
{"title":"Anchors Crash Tensor: Efficient and Scalable Tensorial Multi-View Subspace Clustering","authors":"Jintian Ji;Songhe Feng","doi":"10.1109/TPAMI.2025.3526790","DOIUrl":"10.1109/TPAMI.2025.3526790","url":null,"abstract":"Tensorial Multi-view Clustering (TMC), a prominent approach in multi-view clustering, leverages low-rank tensor learning to capture high-order correlation among views for consistent clustering structure identification. Despite its promising performance, the TMC algorithms face three key challenges: 1). The severe computational burden makes it difficult for TMC methods to handle large-scale datasets. 2). Estimation bias problem caused by the convex surrogate of the tensor rank. 3). Lack of explicit balance of consistency and complementarity. Being aware of these, we propose a basic framework Efficient and Scalable Tensorial Multi-View Subspace Clustering (ESTMC) for large-scale multi-view clustering. ESTMC integrates anchor representation learning and non-convex function-based low-rank tensor learning with a Generalized Non-convex Tensor Rank (GNTR) into a unified objective function, which enhances the efficiency of the existing subspace-based TMC framework. Furthermore, a novel model ESTMC-C<inline-formula><tex-math>$^{2}$</tex-math></inline-formula> with the proposed Enhanced Tensor Rank (ETR), Consistent Geometric Regularization (CGR), and Tensorial Exclusive Regularization (TER) is extended to balance the learning of consistency and complementarity among views, delivering divisible representations for the clustering task. Efficient iterative optimization algorithms are designed to solve the proposed ESTMC and ESTMC-C<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>, which enjoy time-economical complexity and exhibit theoretical convergence. Extensive experimental results on various datasets demonstrate the superiority of the proposed algorithms as compared to state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2660-2675"},"PeriodicalIF":0.0,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-06DOI: 10.1109/TPAMI.2025.3526188
Xingxing Wei;Shouwei Ruan;Yinpeng Dong;Hang Su;Xiaochun Cao
Adversarial patch is one of the important forms of performing adversarial attacks in the physical world. To improve the naturalness and aggressiveness of existing adversarial patches, location-aware patches are proposed, where the patch's location on the target object is integrated into the optimization process to perform attacks. Although it is effective, efficiently finding the optimal location for placing the patches is challenging, especially under the black-box attack settings. In this paper, we first empirically find that the aggregation regions of adversarial patch's locations to show effective attacks for the same facial image are pretty similar across different face recognition models. Based on this observation, we then propose a novel framework called Distribution-Optimized Adversarial Patch (DOPatch) to efficiently search for the aggregation regions in a distribution modeling way. Using the distribution prior, we further design two query-based black-box attack methods: Location Optimization Attack (DOP-LOA) and Distribution Transfer Attack (DOP-DTA) to attack unseen face recognition models. We finally evaluate the proposed methods on various SOTA face recognition models and image recognition models (including the popular big models) to demonstrate our effectiveness and generalization. We also conduct extensive ablation studies and analyses to provide insights into the distribution of adversarial locations.
{"title":"Distributionally Location-Aware Transferable Adversarial Patches for Facial Images","authors":"Xingxing Wei;Shouwei Ruan;Yinpeng Dong;Hang Su;Xiaochun Cao","doi":"10.1109/TPAMI.2025.3526188","DOIUrl":"10.1109/TPAMI.2025.3526188","url":null,"abstract":"Adversarial patch is one of the important forms of performing adversarial attacks in the physical world. To improve the naturalness and aggressiveness of existing adversarial patches, location-aware patches are proposed, where the patch's location on the target object is integrated into the optimization process to perform attacks. Although it is effective, efficiently finding the optimal location for placing the patches is challenging, especially under the black-box attack settings. In this paper, we first empirically find that the aggregation regions of adversarial patch's locations to show effective attacks for the same facial image are pretty similar across different face recognition models. Based on this observation, we then propose a novel framework called Distribution-Optimized Adversarial Patch (DOPatch) to efficiently search for the aggregation regions in a distribution modeling way. Using the distribution prior, we further design two query-based black-box attack methods: Location Optimization Attack (DOP-LOA) and Distribution Transfer Attack (DOP-DTA) to attack unseen face recognition models. We finally evaluate the proposed methods on various SOTA face recognition models and image recognition models (including the popular big models) to demonstrate our effectiveness and generalization. We also conduct extensive ablation studies and analyses to provide insights into the distribution of adversarial locations.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2849-2864"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent methods for synthesizing 3D-aware face images have achieved rapid development thanks to neural radiance fields, allowing for high quality and fast inference speed. However, existing solutions for editing facial geometry and appearance independently usually require retraining and are not optimized for the recent work of generation, thus tending to lag behind the generation process. To address these issues, we introduce NeRFFaceEditing, which enables editing and decoupling geometry and appearance in the pretrained tri-plane-based neural radiance field while retaining its high quality and fast inference speed. Our key idea for disentanglement is to use the statistics of the tri-plane to represent the high-level appearance of its corresponding facial volume. Moreover, we leverage a generated 3D-continuous semantic mask as an intermediary for geometry editing. We devise a geometry decoder (whose output is unchanged when the appearance changes) and an appearance decoder. The geometry decoder aligns the original facial volume with the semantic mask volume. We also enhance the disentanglement by explicitly regularizing rendered images with the same appearance but different geometry to be similar in terms of color distribution for each facial component separately. Our method allows users to edit via semantic masks with decoupled control of geometry and appearance. Both qualitative and quantitative evaluations show the superior geometry and appearance control abilities of our method compared to existing and alternative solutions.
{"title":"Towards High-Quality and Disentangled Face Editing in a 3D GAN","authors":"Kaiwen Jiang;Shu-Yu Chen;Feng-Lin Liu;Hongbo Fu;Lin Gao","doi":"10.1109/TPAMI.2024.3523422","DOIUrl":"10.1109/TPAMI.2024.3523422","url":null,"abstract":"Recent methods for synthesizing 3D-aware face images have achieved rapid development thanks to neural radiance fields, allowing for high quality and fast inference speed. However, existing solutions for editing facial geometry and appearance independently usually require retraining and are not optimized for the recent work of generation, thus tending to lag behind the generation process. To address these issues, we introduce NeRFFaceEditing, which enables editing and decoupling geometry and appearance in the pretrained tri-plane-based neural radiance field while retaining its high quality and fast inference speed. Our key idea for disentanglement is to use the statistics of the tri-plane to represent the high-level appearance of its corresponding facial volume. Moreover, we leverage a generated 3D-continuous semantic mask as an intermediary for geometry editing. We devise a geometry decoder (whose output is unchanged when the appearance changes) and an appearance decoder. The geometry decoder aligns the original facial volume with the semantic mask volume. We also enhance the disentanglement by explicitly regularizing rendered images with the same appearance but different geometry to be similar in terms of color distribution for each facial component separately. Our method allows users to edit via semantic masks with decoupled control of geometry and appearance. Both qualitative and quantitative evaluations show the superior geometry and appearance control abilities of our method compared to existing and alternative solutions.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2533-2544"},"PeriodicalIF":0.0,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142934569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises $leftlangle condition,question,answerrightrangle$ instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.
{"title":"Instruction-Guided Scene Text Recognition","authors":"Yongkun Du;Zhineng Chen;Yuchen Su;Caiyan Jia;Yu-Gang Jiang","doi":"10.1109/TPAMI.2025.3525526","DOIUrl":"10.1109/TPAMI.2025.3525526","url":null,"abstract":"Multi-modal models have shown appealing performance in visual recognition tasks, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models cannot be trivially applied to scene text recognition (STR) due to the compositional difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises <inline-formula><tex-math>$leftlangle condition,question,answerrightrangle$</tex-math></inline-formula> instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops a lightweight instruction encoder, a cross-modal feature fusion module and a multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that differs from current methods considerably. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and fast inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of rarely appearing and morphologically similar characters, which were previous challenges.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2723-2738"},"PeriodicalIF":0.0,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142924715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-03DOI: 10.1109/TPAMI.2025.3525671
Dong Zhang;Kwang-Ting Cheng
Thanks to the recent achievements in task-driven image quality enhancement (IQE) models like ESTR (Liu et al. 2023), the image enhancement model and the visual recognition model can mutually enhance each other's quantitation while producing high-quality processed images that are perceivable by our human vision systems. However, existing task-driven IQE models tend to overlook an underlying fact–different levels of vision tasks have varying and sometimes conflicting requirements of image features. To address this problem, this paper proposes a generalized gradient promotion (GradProm) training strategy for task-driven IQE of medical images. Specifically, we partition a task-driven IQE system into two sub-models, i.e., a mainstream model for image enhancement and an auxiliary model for visual recognition. During training, GradProm updates only parameters of the image enhancement model using gradients of the visual recognition model and the image enhancement model, but only when gradients of these two sub-models are aligned in the same direction, which is measured by their cosine similarity. In case gradients of these two sub-models are not in the same direction, GradProm only uses the gradient of the image enhancement model to update its parameters. Theoretically, we have proved that the optimization direction of the image enhancement model will not be biased by the auxiliary visual recognition model under the implementation of GradProm. Empirically, extensive experimental results on four public yet challenging medical image datasets demonstrated the superior performance of GradProm over existing state-of-the-art methods.
{"title":"Generalized Task-Driven Medical Image Quality Enhancement With Gradient Promotion","authors":"Dong Zhang;Kwang-Ting Cheng","doi":"10.1109/TPAMI.2025.3525671","DOIUrl":"10.1109/TPAMI.2025.3525671","url":null,"abstract":"Thanks to the recent achievements in task-driven image quality enhancement (IQE) models like ESTR (Liu et al. 2023), the image enhancement model and the visual recognition model can mutually enhance each other's quantitation while producing high-quality processed images that are perceivable by our human vision systems. However, existing task-driven IQE models tend to overlook an underlying fact–different levels of vision tasks have varying and sometimes conflicting requirements of image features. To address this problem, this paper proposes a generalized gradient promotion (<italic>GradProm</i>) training strategy for task-driven IQE of medical images. Specifically, we partition a task-driven IQE system into two sub-models, i.e., a mainstream model for image enhancement and an auxiliary model for visual recognition. During training, <italic>GradProm</i> updates only parameters of the image enhancement model using gradients of the visual recognition model and the image enhancement model, but only when gradients of these two sub-models are aligned in the same direction, which is measured by their cosine similarity. In case gradients of these two sub-models are not in the same direction, <italic>GradProm</i> only uses the gradient of the image enhancement model to update its parameters. Theoretically, we have proved that the optimization direction of the image enhancement model will not be biased by the auxiliary visual recognition model under the implementation of <italic>GradProm</i>. Empirically, extensive experimental results on four public yet challenging medical image datasets demonstrated the superior performance of <italic>GradProm</i> over existing state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2785-2798"},"PeriodicalIF":0.0,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142924714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The deep unfolding network represents a promising research avenue in image restoration. However, most current deep unfolding methodologies are anchored in first-order optimization algorithms, which suffer from sluggish convergence speed and unsatisfactory learning efficiency. In this paper, to address this issue, we first formulate an improved second-order semi-smooth Newton (ISN) algorithm, transforming the original nonlinear equations into an optimization problem amenable to network implementation. After that, we propose an innovative network architecture based on the ISN algorithm for blind image restoration, namely DeepSN-Net. To the best of our knowledge, DeepSN-Net is the first successful endeavor to design a second-order deep unfolding network for image restoration, which fills the blank of this area. Furthermore, it offers several distinct advantages: 1) DeepSN-Net provides a unified framework to a variety of image restoration tasks in both synthetic and real-world contexts, without imposing constraints on the degradation conditions. 2) The network architecture is meticulously aligned with the ISN algorithm, ensuring that each module possesses robust physical interpretability. 3) The network exhibits high learning efficiency, superior restoration accuracy and good generalization ability across 11 datasets on three typical restoration tasks. The success of DeepSN-Net on image restoration may ignite many subsequent works centered around the second-order optimization algorithms, which is good for the community.
{"title":"DeepSN-Net: Deep Semi-Smooth Newton Driven Network for Blind Image Restoration","authors":"Xin Deng;Chenxiao Zhang;Lai Jiang;Jingyuan Xia;Mai Xu","doi":"10.1109/TPAMI.2024.3525089","DOIUrl":"10.1109/TPAMI.2024.3525089","url":null,"abstract":"The deep unfolding network represents a promising research avenue in image restoration. However, most current deep unfolding methodologies are anchored in first-order optimization algorithms, which suffer from sluggish convergence speed and unsatisfactory learning efficiency. In this paper, to address this issue, we first formulate an improved second-order semi-smooth Newton (ISN) algorithm, transforming the original nonlinear equations into an optimization problem amenable to network implementation. After that, we propose an innovative network architecture based on the ISN algorithm for blind image restoration, namely DeepSN-Net. To the best of our knowledge, DeepSN-Net is the first successful endeavor to design a second-order deep unfolding network for image restoration, which fills the blank of this area. Furthermore, it offers several distinct advantages: 1) DeepSN-Net provides a unified framework to a variety of image restoration tasks in both synthetic and real-world contexts, without imposing constraints on the degradation conditions. 2) The network architecture is meticulously aligned with the ISN algorithm, ensuring that each module possesses robust physical interpretability. 3) The network exhibits high learning efficiency, superior restoration accuracy and good generalization ability across 11 datasets on three typical restoration tasks. The success of DeepSN-Net on image restoration may ignite many subsequent works centered around the second-order optimization algorithms, which is good for the community.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2632-2646"},"PeriodicalIF":0.0,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142917151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Retinex model-based methods have shown to be effective in layer-wise manipulation with well-designed priors for low-light image enhancement (LLIE). However, the hand-crafted priors and conventional optimization algorithm adopted to solve the layer decomposition problem result in the lack of adaptivity and efficiency. To this end, this paper proposes a Retinex-based deep unfolding network (URetinex-Net++), which unfolds an optimization problem into a learnable network to decompose a low-light image into reflectance and illumination layers. By formulating the decomposition problem as an implicit priors regularized model, three learning-based modules are carefully designed, responsible for data-dependent initialization, high-efficient unfolding optimization, and fairly-flexible component adjustment, respectively. Particularly, the proposed unfolding optimization module, introducing two networks to adaptively fit implicit priors in the data-driven manner, can realize noise suppression and details preservation for decomposed components. URetinex-Net++ is a further augmented version of URetinex-Net, which introduces a cross-stage fusion block to alleviate the color defect in URetinex-Net. Therefore, boosted performance on LLIE can be obtained in both visual quality and quantitative metrics, where only a few parameters are introduced and little time is cost. Extensive experiments on real-world low-light images qualitatively and quantitatively demonstrate the effectiveness and superiority of the proposed URetinex-Net++ over state-of-the-art methods.
{"title":"Interpretable Optimization-Inspired Unfolding Network for Low-Light Image Enhancement","authors":"Wenhui Wu;Jian Weng;Pingping Zhang;Xu Wang;Wenhan Yang;Jianmin Jiang","doi":"10.1109/TPAMI.2024.3524538","DOIUrl":"10.1109/TPAMI.2024.3524538","url":null,"abstract":"Retinex model-based methods have shown to be effective in layer-wise manipulation with well-designed priors for low-light image enhancement (LLIE). However, the hand-crafted priors and conventional optimization algorithm adopted to solve the layer decomposition problem result in the lack of adaptivity and efficiency. To this end, this paper proposes a Retinex-based deep unfolding network (URetinex-Net++), which unfolds an optimization problem into a learnable network to decompose a low-light image into reflectance and illumination layers. By formulating the decomposition problem as an implicit priors regularized model, three learning-based modules are carefully designed, responsible for data-dependent initialization, high-efficient unfolding optimization, and fairly-flexible component adjustment, respectively. Particularly, the proposed unfolding optimization module, introducing two networks to adaptively fit implicit priors in the data-driven manner, can realize noise suppression and details preservation for decomposed components. URetinex-Net++ is a further augmented version of URetinex-Net, which introduces a cross-stage fusion block to alleviate the color defect in URetinex-Net. Therefore, boosted performance on LLIE can be obtained in both visual quality and quantitative metrics, where only a few parameters are introduced and little time is cost. Extensive experiments on real-world low-light images qualitatively and quantitatively demonstrate the effectiveness and superiority of the proposed URetinex-Net++ over state-of-the-art methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2545-2562"},"PeriodicalIF":0.0,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142911797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1109/TPAMI.2024.3524381
Hang Lin;Yifan Peng;Yubo Zhang;Lin Bie;Xibin Zhao;Yue Gao
Large amount of redundancy is widely present in convolutional neural networks (CNNs). Identifying the redundancy in the network and removing the redundant filters is an effective way to compress the CNN model size with a minimal reduction in performance. However, most of the existing redundancy-based pruning methods only consider the distance information between two filters, which can only model simple correlations between filters. Moreover, we point out that distance-based pruning methods are not applicable for high-dimensional features in CNN models by our experimental observations and analysis. To tackle this issue, we propose a new pruning strategy based on high-order spectral clustering. In this approach, we use hypergraph structure to construct complex correlations among filters, and obtain high-order information among filters by hypergraph structure learning. Finally, based on the high-order information, we can perform better clustering on the filters and remove the redundant filters in each cluster. Experiments on various CNN models and datasets demonstrate that our proposed method outperforms the recent state-of-the-art works. For example, with ResNet50, we achieve a 57.1% FLOPs reduction with no accuracy drop on ImageNet, which is the first to achieve lossless pruning with such a high compression ratio.
{"title":"Filter Pruning by High-Order Spectral Clustering","authors":"Hang Lin;Yifan Peng;Yubo Zhang;Lin Bie;Xibin Zhao;Yue Gao","doi":"10.1109/TPAMI.2024.3524381","DOIUrl":"10.1109/TPAMI.2024.3524381","url":null,"abstract":"Large amount of redundancy is widely present in convolutional neural networks (CNNs). Identifying the redundancy in the network and removing the redundant filters is an effective way to compress the CNN model size with a minimal reduction in performance. However, most of the existing redundancy-based pruning methods only consider the distance information between two filters, which can only model simple correlations between filters. Moreover, we point out that distance-based pruning methods are not applicable for high-dimensional features in CNN models by our experimental observations and analysis. To tackle this issue, we propose a new pruning strategy based on high-order spectral clustering. In this approach, we use hypergraph structure to construct complex correlations among filters, and obtain high-order information among filters by hypergraph structure learning. Finally, based on the high-order information, we can perform better clustering on the filters and remove the redundant filters in each cluster. Experiments on various CNN models and datasets demonstrate that our proposed method outperforms the recent state-of-the-art works. For example, with ResNet50, we achieve a 57.1% FLOPs reduction with no accuracy drop on ImageNet, which is the first to achieve lossless pruning with such a high compression ratio.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2402-2415"},"PeriodicalIF":0.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142908392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1109/TPAMI.2024.3519674
Bo Sun;Hao Kang;Li Guan;Haoxiang Li;Philippos Mordohai;Gang Hua
We present a deep learning model, dubbed Glissando-Net, to simultaneously estimate the pose and reconstruct the 3D shape of objects at the category level from a single RGB image. Previous works predominantly focused on either estimating poses (often at the instance level), or reconstructing shapes, but not both. Glissando-Net is composed of two auto-encoders that are jointly trained, one for RGB images and the other for point clouds. We embrace two key design choices in Glissando-Net to achieve a more accurate prediction of the 3D shape and pose of the object given a single RGB image as input. First, we augment the feature maps of the point cloud encoder and decoder with transformed feature maps from the image decoder, enabling effective 2D-3D interaction in both training and prediction. Second, we predict both the 3D shape and pose of the object in the decoder stage. This way, we better utilize the information in the 3D point clouds presented only in the training stage to train the network for more accurate prediction. We jointly train the two encoder-decoders for RGB and point cloud data to learn how to pass latent features to the point cloud decoder during inference. In testing, the encoder of the 3D point cloud is discarded. The design of Glissando-Net is inspired by codeSLAM. Unlike codeSLAM, which targets 3D reconstruction of scenes, we focus on pose estimation and shape reconstruction of objects, and directly predict the object pose and a pose invariant 3D reconstruction without the need of the code optimization step. Extensive experiments, involving both ablation studies and comparison with competing methods, demonstrate the efficacy of our proposed method, and compare favorably with the state-of-the-art.
{"title":"Glissando-Net: Deep Single View Category Level Pose Estimation and 3D Reconstruction","authors":"Bo Sun;Hao Kang;Li Guan;Haoxiang Li;Philippos Mordohai;Gang Hua","doi":"10.1109/TPAMI.2024.3519674","DOIUrl":"10.1109/TPAMI.2024.3519674","url":null,"abstract":"We present a deep learning model, dubbed Glissando-Net, to simultaneously estimate the pose and reconstruct the 3D shape of objects at the category level from a single RGB image. Previous works predominantly focused on either estimating poses (often at the instance level), or reconstructing shapes, but not both. Glissando-Net is composed of two auto-encoders that are jointly trained, one for RGB images and the other for point clouds. We embrace two key design choices in Glissando-Net to achieve a more accurate prediction of the 3D shape and pose of the object given a single RGB image as input. First, we augment the feature maps of the point cloud encoder and decoder with transformed feature maps from the image decoder, enabling effective 2D-3D interaction in both training and prediction. Second, we predict both the 3D shape and pose of the object in the decoder stage. This way, we better utilize the information in the 3D point clouds presented only in the training stage to train the network for more accurate prediction. We jointly train the two encoder-decoders for RGB and point cloud data to learn how to pass latent features to the point cloud decoder during inference. In testing, the encoder of the 3D point cloud is discarded. The design of Glissando-Net is inspired by codeSLAM. Unlike codeSLAM, which targets 3D reconstruction of scenes, we focus on pose estimation and shape reconstruction of objects, and directly predict the object pose and a pose invariant 3D reconstruction without the need of the code optimization step. Extensive experiments, involving both ablation studies and comparison with competing methods, demonstrate the efficacy of our proposed method, and compare favorably with the state-of-the-art.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2298-2312"},"PeriodicalIF":0.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142908424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}