With the rise of Extended Reality (XR) technology, there is a growing need for real-time light field reconstruction from sparse view inputs. Existing methods can be classified into offline techniques, which can generate high-quality novel views but at the cost of long inference/training time, and online methods, which either lack generalizability or produce unsatisfactory results. However, we have observed that the intrinsic sparse manifold of Multi-plane Images (MPI) enables a significant acceleration of light field reconstruction while maintaining rendering quality. Based on this insight, we introduce RealLiFe, a novel light field optimization method, which leverages the proposed Hierarchical Sparse Gradient Descent (HSGD) to produce high-quality light fields from sparse input images in real time. Technically, the coarse MPI of a scene is first generated using a 3D CNN, and it is further optimized leveraging only the scene content aligned sparse MPI gradients in a few iterations. Extensive experiments demonstrate that our method achieves comparable visual quality while being 100x faster on average than state-of-the-art offline methods and delivers better performance (about 2 dB higher in PSNR) compared to other online approaches.
{"title":"RealLiFe: Real-Time Light Field Reconstruction via Hierarchical Sparse Gradient Descent.","authors":"Yijie Deng,Lei Han,Tianpeng Lin,Lin Li,Jinzhi Zhang,Lu Fang","doi":"10.1109/tpami.2026.3651958","DOIUrl":"https://doi.org/10.1109/tpami.2026.3651958","url":null,"abstract":"With the rise of Extended Reality (XR) technology, there is a growing need for real-time light field reconstruction from sparse view inputs. Existing methods can be classified into offline techniques, which can generate high-quality novel views but at the cost of long inference/training time, and online methods, which either lack generalizability or produce unsatisfactory results. However, we have observed that the intrinsic sparse manifold of Multi-plane Images (MPI) enables a significant acceleration of light field reconstruction while maintaining rendering quality. Based on this insight, we introduce RealLiFe, a novel light field optimization method, which leverages the proposed Hierarchical Sparse Gradient Descent (HSGD) to produce high-quality light fields from sparse input images in real time. Technically, the coarse MPI of a scene is first generated using a 3D CNN, and it is further optimized leveraging only the scene content aligned sparse MPI gradients in a few iterations. Extensive experiments demonstrate that our method achieves comparable visual quality while being 100x faster on average than state-of-the-art offline methods and delivers better performance (about 2 dB higher in PSNR) compared to other online approaches.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"31 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Secure and high-capacity secret information transmission is an important task of the image hiding research. The existing image hiding methods face some critical issues: cover-based methods offer high capacity but introduce image distortion and security risks, whereas secure coverless methods have low capacity. To address these issues, this paper proposes a novel generative-based coverless multi-image hiding method called GCL-MIH, which can achieve high capacity and high security. The GCL-MIH first utilizes a feature reverse module to compress multiple secret images into multiple feature vectors and then normalizes them to generate a vector that conforms to a standard normal distribution, and finally inputs this vector into an invertible generative network (Flow-GAN) to generate a face image, enabling coverless multiple-image hiding without a predefined cover image. Experimental results demonstrate that the GCL-MIH successfully hides up to four images within a single generated face image, achieving a maximum embedding rate of 32 bpp. This capacity far exceeds those of the existing coverless methods. On the COCO test set, the generated stego images of the GCL-MIH are highly realistic (FID score: 11.98), and the recovered secret images exhibit satisfactory fidelity (the average PSNR and SSIM of four recovered secret images are 33.18 dB and 0.9412).
{"title":"GCL-MIH: A Generative-Based Coverless Multi-Image Hiding Method.","authors":"Liang Chen,Xianquan Zhang,Chunqiang Yu,Xinpeng Zhang,Ching-Nung Yang,Zhenjun Tang","doi":"10.1109/tpami.2026.3658731","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658731","url":null,"abstract":"Secure and high-capacity secret information transmission is an important task of the image hiding research. The existing image hiding methods face some critical issues: cover-based methods offer high capacity but introduce image distortion and security risks, whereas secure coverless methods have low capacity. To address these issues, this paper proposes a novel generative-based coverless multi-image hiding method called GCL-MIH, which can achieve high capacity and high security. The GCL-MIH first utilizes a feature reverse module to compress multiple secret images into multiple feature vectors and then normalizes them to generate a vector that conforms to a standard normal distribution, and finally inputs this vector into an invertible generative network (Flow-GAN) to generate a face image, enabling coverless multiple-image hiding without a predefined cover image. Experimental results demonstrate that the GCL-MIH successfully hides up to four images within a single generated face image, achieving a maximum embedding rate of 32 bpp. This capacity far exceeds those of the existing coverless methods. On the COCO test set, the generated stego images of the GCL-MIH are highly realistic (FID score: 11.98), and the recovered secret images exhibit satisfactory fidelity (the average PSNR and SSIM of four recovered secret images are 33.18 dB and 0.9412).","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"42 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/tpami.2026.3658949
Keji He,Yan Huang,Ya Jing,Qi Wu,Liang Wang
The Vision-and-Language Navigation (VLN) task involves an agent navigating within 3D indoor environments based on provided instructions. Achieving cross-modal alignment presents one of the most critical challenges in VLN, as the predicted trajectory needs to precisely align with the given instruction. This paper focuses on addressing cross-modal alignment in VLN from a fine-grained perspective. Firstly, to address the issue of weak cross-modal alignment supervision arising from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset called Landmark-RxR. This dataset aims to offer precise, fine-grained supervision for VLN. Secondly, in order to comprehensively demonstrate the potential and advantage of the fine-grained data from Landmark-RxR, we explore the core components of the training process that depend on the characteristics of the training data. These components include data augmentation, training paradigm, reward shaping, and navigation loss design. Leveraging our fine-grained data, we carefully design methods for handling them and introduce a novel evaluation mechanism. The experimental results demonstrate that the fine-grained data can effectively improve the agent's cross-modal alignment ability. Access to the Landmark-RxR dataset can be obtained from https://github.com/hekj/Landmark-RxR.
{"title":"Fine-Grained Alignment Supervision Matters in Vision-and-Language Navigation.","authors":"Keji He,Yan Huang,Ya Jing,Qi Wu,Liang Wang","doi":"10.1109/tpami.2026.3658949","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658949","url":null,"abstract":"The Vision-and-Language Navigation (VLN) task involves an agent navigating within 3D indoor environments based on provided instructions. Achieving cross-modal alignment presents one of the most critical challenges in VLN, as the predicted trajectory needs to precisely align with the given instruction. This paper focuses on addressing cross-modal alignment in VLN from a fine-grained perspective. Firstly, to address the issue of weak cross-modal alignment supervision arising from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset called Landmark-RxR. This dataset aims to offer precise, fine-grained supervision for VLN. Secondly, in order to comprehensively demonstrate the potential and advantage of the fine-grained data from Landmark-RxR, we explore the core components of the training process that depend on the characteristics of the training data. These components include data augmentation, training paradigm, reward shaping, and navigation loss design. Leveraging our fine-grained data, we carefully design methods for handling them and introduce a novel evaluation mechanism. The experimental results demonstrate that the fine-grained data can effectively improve the agent's cross-modal alignment ability. Access to the Landmark-RxR dataset can be obtained from https://github.com/hekj/Landmark-RxR.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"82 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions (TSDs)-critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks. Additionally, based on our exploration of TSD, we focus on an important issue in PEFT: the initialization of LoRA. While some works have pointed out the significance of initialization for LoRA's performance and proposed various strategies, these methods are often empirical and not task-specific. To address this issue, we propose LoRA-Init. Starting from TSD, we identify the directions that require the most adjustment during fine-tuning for downstream tasks. By initializing the matrices in LoRA with these directions, LoRA-Init significantly enhances LoRA's performance. Moreover, we can combine LoRA-Dash and LoRA-Init to create the final version of LoRA based on TSDs, which we refer to as LoRA-TSD. Extensive experiments have conclusively demonstrated the effectiveness of these methods, and in-depth analyses further reveal the underlying mechanisms of these methods. The codes are available athttps://github.com/Chongjie-Si/Subspace-Tuning.
{"title":"Task-Specific Directions: Definition, Exploration, and Utilization in Parameter Efficient Fine-Tuning.","authors":"Chongjie Si,Zhiyi Shi,Shifan Zhang,Xiaokang Yang,Hanspeter Pfister,Wei Shen","doi":"10.1109/tpami.2026.3659168","DOIUrl":"https://doi.org/10.1109/tpami.2026.3659168","url":null,"abstract":"Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions (TSDs)-critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks. Additionally, based on our exploration of TSD, we focus on an important issue in PEFT: the initialization of LoRA. While some works have pointed out the significance of initialization for LoRA's performance and proposed various strategies, these methods are often empirical and not task-specific. To address this issue, we propose LoRA-Init. Starting from TSD, we identify the directions that require the most adjustment during fine-tuning for downstream tasks. By initializing the matrices in LoRA with these directions, LoRA-Init significantly enhances LoRA's performance. Moreover, we can combine LoRA-Dash and LoRA-Init to create the final version of LoRA based on TSDs, which we refer to as LoRA-TSD. Extensive experiments have conclusively demonstrated the effectiveness of these methods, and in-depth analyses further reveal the underlying mechanisms of these methods. The codes are available athttps://github.com/Chongjie-Si/Subspace-Tuning.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"58 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/tpami.2026.3659463
Qixuan Zheng,Ming Zhang,Hong Yan
Compatibilities between the hyperedges of two hy-pergraphs can be represented as a sparse tensor to avoid expo-nentially increasing computational costs in hypergraph matching. Kd-tree-based approximate nearest neighbor (ANN) methods have been widely adopted to obtain the sparse compatibility tensor and usually need a relatively high density to guarantee greater accuracy without prior knowledge of the correspondences between a pair of feature point sets. For large scale problems, they require exhaustive computations. This work introduces a novel cascaded second and third-order framework for efficient hypergraph matching. Its core is a CUR decomposition-based sparse compatibility tensor generation method. A rough node assignment is calculated first by a CUR-based pairwise matching process that has a lower computational cost in the second order. Using that intermediate assignment as prior knowledge, a compatibility tensor with higher sparsity can be calculated, with a significantly decreased memory footprint by a novel probability relaxation labeling (PRL)-based hypergraph matching algorithm. The term "reliability" was used to describe how the tensor affects the matching performance and a new measurement, the reliability rate, was proposed to quantify the reliability of a sparse compatibility tensor. Experiment results on large-scale synthetic datasets, and widely adopted benchmarks, demonstrated that the proposed framework outperformed existing methods, creating a more than ten times sparser, but more reliable, compatibility tensor. This proposed CUR-based tensor generation method can be integrated into existing hypergraph matching algorithms and will significantly increase their performance with lower computational costs.
{"title":"A CUR Decomposition-Based Mix-Order Framework for Large-Scale Hypergraph Matching.","authors":"Qixuan Zheng,Ming Zhang,Hong Yan","doi":"10.1109/tpami.2026.3659463","DOIUrl":"https://doi.org/10.1109/tpami.2026.3659463","url":null,"abstract":"Compatibilities between the hyperedges of two hy-pergraphs can be represented as a sparse tensor to avoid expo-nentially increasing computational costs in hypergraph matching. Kd-tree-based approximate nearest neighbor (ANN) methods have been widely adopted to obtain the sparse compatibility tensor and usually need a relatively high density to guarantee greater accuracy without prior knowledge of the correspondences between a pair of feature point sets. For large scale problems, they require exhaustive computations. This work introduces a novel cascaded second and third-order framework for efficient hypergraph matching. Its core is a CUR decomposition-based sparse compatibility tensor generation method. A rough node assignment is calculated first by a CUR-based pairwise matching process that has a lower computational cost in the second order. Using that intermediate assignment as prior knowledge, a compatibility tensor with higher sparsity can be calculated, with a significantly decreased memory footprint by a novel probability relaxation labeling (PRL)-based hypergraph matching algorithm. The term \"reliability\" was used to describe how the tensor affects the matching performance and a new measurement, the reliability rate, was proposed to quantify the reliability of a sparse compatibility tensor. Experiment results on large-scale synthetic datasets, and widely adopted benchmarks, demonstrated that the proposed framework outperformed existing methods, creating a more than ten times sparser, but more reliable, compatibility tensor. This proposed CUR-based tensor generation method can be integrated into existing hypergraph matching algorithms and will significantly increase their performance with lower computational costs.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"143 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/tpami.2026.3658965
Michael Fuest,Pingchuan Ma,Ming Gui,Johannes Schusterbauer,Vincent Tao Hu,Bjorn Ommer
Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy.
{"title":"Diffusion Models and Representation Learning: A Survey.","authors":"Michael Fuest,Pingchuan Ma,Ming Gui,Johannes Schusterbauer,Vincent Tao Hu,Bjorn Ommer","doi":"10.1109/tpami.2026.3658965","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658965","url":null,"abstract":"Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"94 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}