With the rise of Extended Reality (XR) technology, there is a growing need for real-time light field reconstruction from sparse view inputs. Existing methods can be classified into offline techniques, which can generate high-quality novel views but at the cost of long inference/training time, and online methods, which either lack generalizability or produce unsatisfactory results. However, we have observed that the intrinsic sparse manifold of Multi-plane Images (MPI) enables a significant acceleration of light field reconstruction while maintaining rendering quality. Based on this insight, we introduce RealLiFe, a novel light field optimization method, which leverages the proposed Hierarchical Sparse Gradient Descent (HSGD) to produce high-quality light fields from sparse input images in real time. Technically, the coarse MPI of a scene is first generated using a 3D CNN, and it is further optimized leveraging only the scene content aligned sparse MPI gradients in a few iterations. Extensive experiments demonstrate that our method achieves comparable visual quality while being 100x faster on average than state-of-the-art offline methods and delivers better performance (about 2 dB higher in PSNR) compared to other online approaches.
{"title":"RealLiFe: Real-Time Light Field Reconstruction via Hierarchical Sparse Gradient Descent.","authors":"Yijie Deng,Lei Han,Tianpeng Lin,Lin Li,Jinzhi Zhang,Lu Fang","doi":"10.1109/tpami.2026.3651958","DOIUrl":"https://doi.org/10.1109/tpami.2026.3651958","url":null,"abstract":"With the rise of Extended Reality (XR) technology, there is a growing need for real-time light field reconstruction from sparse view inputs. Existing methods can be classified into offline techniques, which can generate high-quality novel views but at the cost of long inference/training time, and online methods, which either lack generalizability or produce unsatisfactory results. However, we have observed that the intrinsic sparse manifold of Multi-plane Images (MPI) enables a significant acceleration of light field reconstruction while maintaining rendering quality. Based on this insight, we introduce RealLiFe, a novel light field optimization method, which leverages the proposed Hierarchical Sparse Gradient Descent (HSGD) to produce high-quality light fields from sparse input images in real time. Technically, the coarse MPI of a scene is first generated using a 3D CNN, and it is further optimized leveraging only the scene content aligned sparse MPI gradients in a few iterations. Extensive experiments demonstrate that our method achieves comparable visual quality while being 100x faster on average than state-of-the-art offline methods and delivers better performance (about 2 dB higher in PSNR) compared to other online approaches.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"31 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Secure and high-capacity secret information transmission is an important task of the image hiding research. The existing image hiding methods face some critical issues: cover-based methods offer high capacity but introduce image distortion and security risks, whereas secure coverless methods have low capacity. To address these issues, this paper proposes a novel generative-based coverless multi-image hiding method called GCL-MIH, which can achieve high capacity and high security. The GCL-MIH first utilizes a feature reverse module to compress multiple secret images into multiple feature vectors and then normalizes them to generate a vector that conforms to a standard normal distribution, and finally inputs this vector into an invertible generative network (Flow-GAN) to generate a face image, enabling coverless multiple-image hiding without a predefined cover image. Experimental results demonstrate that the GCL-MIH successfully hides up to four images within a single generated face image, achieving a maximum embedding rate of 32 bpp. This capacity far exceeds those of the existing coverless methods. On the COCO test set, the generated stego images of the GCL-MIH are highly realistic (FID score: 11.98), and the recovered secret images exhibit satisfactory fidelity (the average PSNR and SSIM of four recovered secret images are 33.18 dB and 0.9412).
{"title":"GCL-MIH: A Generative-Based Coverless Multi-Image Hiding Method.","authors":"Liang Chen,Xianquan Zhang,Chunqiang Yu,Xinpeng Zhang,Ching-Nung Yang,Zhenjun Tang","doi":"10.1109/tpami.2026.3658731","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658731","url":null,"abstract":"Secure and high-capacity secret information transmission is an important task of the image hiding research. The existing image hiding methods face some critical issues: cover-based methods offer high capacity but introduce image distortion and security risks, whereas secure coverless methods have low capacity. To address these issues, this paper proposes a novel generative-based coverless multi-image hiding method called GCL-MIH, which can achieve high capacity and high security. The GCL-MIH first utilizes a feature reverse module to compress multiple secret images into multiple feature vectors and then normalizes them to generate a vector that conforms to a standard normal distribution, and finally inputs this vector into an invertible generative network (Flow-GAN) to generate a face image, enabling coverless multiple-image hiding without a predefined cover image. Experimental results demonstrate that the GCL-MIH successfully hides up to four images within a single generated face image, achieving a maximum embedding rate of 32 bpp. This capacity far exceeds those of the existing coverless methods. On the COCO test set, the generated stego images of the GCL-MIH are highly realistic (FID score: 11.98), and the recovered secret images exhibit satisfactory fidelity (the average PSNR and SSIM of four recovered secret images are 33.18 dB and 0.9412).","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"42 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/tpami.2026.3658949
Keji He,Yan Huang,Ya Jing,Qi Wu,Liang Wang
The Vision-and-Language Navigation (VLN) task involves an agent navigating within 3D indoor environments based on provided instructions. Achieving cross-modal alignment presents one of the most critical challenges in VLN, as the predicted trajectory needs to precisely align with the given instruction. This paper focuses on addressing cross-modal alignment in VLN from a fine-grained perspective. Firstly, to address the issue of weak cross-modal alignment supervision arising from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset called Landmark-RxR. This dataset aims to offer precise, fine-grained supervision for VLN. Secondly, in order to comprehensively demonstrate the potential and advantage of the fine-grained data from Landmark-RxR, we explore the core components of the training process that depend on the characteristics of the training data. These components include data augmentation, training paradigm, reward shaping, and navigation loss design. Leveraging our fine-grained data, we carefully design methods for handling them and introduce a novel evaluation mechanism. The experimental results demonstrate that the fine-grained data can effectively improve the agent's cross-modal alignment ability. Access to the Landmark-RxR dataset can be obtained from https://github.com/hekj/Landmark-RxR.
{"title":"Fine-Grained Alignment Supervision Matters in Vision-and-Language Navigation.","authors":"Keji He,Yan Huang,Ya Jing,Qi Wu,Liang Wang","doi":"10.1109/tpami.2026.3658949","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658949","url":null,"abstract":"The Vision-and-Language Navigation (VLN) task involves an agent navigating within 3D indoor environments based on provided instructions. Achieving cross-modal alignment presents one of the most critical challenges in VLN, as the predicted trajectory needs to precisely align with the given instruction. This paper focuses on addressing cross-modal alignment in VLN from a fine-grained perspective. Firstly, to address the issue of weak cross-modal alignment supervision arising from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset called Landmark-RxR. This dataset aims to offer precise, fine-grained supervision for VLN. Secondly, in order to comprehensively demonstrate the potential and advantage of the fine-grained data from Landmark-RxR, we explore the core components of the training process that depend on the characteristics of the training data. These components include data augmentation, training paradigm, reward shaping, and navigation loss design. Leveraging our fine-grained data, we carefully design methods for handling them and introduce a novel evaluation mechanism. The experimental results demonstrate that the fine-grained data can effectively improve the agent's cross-modal alignment ability. Access to the Landmark-RxR dataset can be obtained from https://github.com/hekj/Landmark-RxR.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"82 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions (TSDs)-critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks. Additionally, based on our exploration of TSD, we focus on an important issue in PEFT: the initialization of LoRA. While some works have pointed out the significance of initialization for LoRA's performance and proposed various strategies, these methods are often empirical and not task-specific. To address this issue, we propose LoRA-Init. Starting from TSD, we identify the directions that require the most adjustment during fine-tuning for downstream tasks. By initializing the matrices in LoRA with these directions, LoRA-Init significantly enhances LoRA's performance. Moreover, we can combine LoRA-Dash and LoRA-Init to create the final version of LoRA based on TSDs, which we refer to as LoRA-TSD. Extensive experiments have conclusively demonstrated the effectiveness of these methods, and in-depth analyses further reveal the underlying mechanisms of these methods. The codes are available athttps://github.com/Chongjie-Si/Subspace-Tuning.
{"title":"Task-Specific Directions: Definition, Exploration, and Utilization in Parameter Efficient Fine-Tuning.","authors":"Chongjie Si,Zhiyi Shi,Shifan Zhang,Xiaokang Yang,Hanspeter Pfister,Wei Shen","doi":"10.1109/tpami.2026.3659168","DOIUrl":"https://doi.org/10.1109/tpami.2026.3659168","url":null,"abstract":"Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions (TSDs)-critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks. Additionally, based on our exploration of TSD, we focus on an important issue in PEFT: the initialization of LoRA. While some works have pointed out the significance of initialization for LoRA's performance and proposed various strategies, these methods are often empirical and not task-specific. To address this issue, we propose LoRA-Init. Starting from TSD, we identify the directions that require the most adjustment during fine-tuning for downstream tasks. By initializing the matrices in LoRA with these directions, LoRA-Init significantly enhances LoRA's performance. Moreover, we can combine LoRA-Dash and LoRA-Init to create the final version of LoRA based on TSDs, which we refer to as LoRA-TSD. Extensive experiments have conclusively demonstrated the effectiveness of these methods, and in-depth analyses further reveal the underlying mechanisms of these methods. The codes are available athttps://github.com/Chongjie-Si/Subspace-Tuning.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"58 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/tpami.2026.3659463
Qixuan Zheng,Ming Zhang,Hong Yan
Compatibilities between the hyperedges of two hy-pergraphs can be represented as a sparse tensor to avoid expo-nentially increasing computational costs in hypergraph matching. Kd-tree-based approximate nearest neighbor (ANN) methods have been widely adopted to obtain the sparse compatibility tensor and usually need a relatively high density to guarantee greater accuracy without prior knowledge of the correspondences between a pair of feature point sets. For large scale problems, they require exhaustive computations. This work introduces a novel cascaded second and third-order framework for efficient hypergraph matching. Its core is a CUR decomposition-based sparse compatibility tensor generation method. A rough node assignment is calculated first by a CUR-based pairwise matching process that has a lower computational cost in the second order. Using that intermediate assignment as prior knowledge, a compatibility tensor with higher sparsity can be calculated, with a significantly decreased memory footprint by a novel probability relaxation labeling (PRL)-based hypergraph matching algorithm. The term "reliability" was used to describe how the tensor affects the matching performance and a new measurement, the reliability rate, was proposed to quantify the reliability of a sparse compatibility tensor. Experiment results on large-scale synthetic datasets, and widely adopted benchmarks, demonstrated that the proposed framework outperformed existing methods, creating a more than ten times sparser, but more reliable, compatibility tensor. This proposed CUR-based tensor generation method can be integrated into existing hypergraph matching algorithms and will significantly increase their performance with lower computational costs.
{"title":"A CUR Decomposition-Based Mix-Order Framework for Large-Scale Hypergraph Matching.","authors":"Qixuan Zheng,Ming Zhang,Hong Yan","doi":"10.1109/tpami.2026.3659463","DOIUrl":"https://doi.org/10.1109/tpami.2026.3659463","url":null,"abstract":"Compatibilities between the hyperedges of two hy-pergraphs can be represented as a sparse tensor to avoid expo-nentially increasing computational costs in hypergraph matching. Kd-tree-based approximate nearest neighbor (ANN) methods have been widely adopted to obtain the sparse compatibility tensor and usually need a relatively high density to guarantee greater accuracy without prior knowledge of the correspondences between a pair of feature point sets. For large scale problems, they require exhaustive computations. This work introduces a novel cascaded second and third-order framework for efficient hypergraph matching. Its core is a CUR decomposition-based sparse compatibility tensor generation method. A rough node assignment is calculated first by a CUR-based pairwise matching process that has a lower computational cost in the second order. Using that intermediate assignment as prior knowledge, a compatibility tensor with higher sparsity can be calculated, with a significantly decreased memory footprint by a novel probability relaxation labeling (PRL)-based hypergraph matching algorithm. The term \"reliability\" was used to describe how the tensor affects the matching performance and a new measurement, the reliability rate, was proposed to quantify the reliability of a sparse compatibility tensor. Experiment results on large-scale synthetic datasets, and widely adopted benchmarks, demonstrated that the proposed framework outperformed existing methods, creating a more than ten times sparser, but more reliable, compatibility tensor. This proposed CUR-based tensor generation method can be integrated into existing hypergraph matching algorithms and will significantly increase their performance with lower computational costs.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"143 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/tpami.2026.3658965
Michael Fuest,Pingchuan Ma,Ming Gui,Johannes Schusterbauer,Vincent Tao Hu,Bjorn Ommer
Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy.
{"title":"Diffusion Models and Representation Learning: A Survey.","authors":"Michael Fuest,Pingchuan Ma,Ming Gui,Johannes Schusterbauer,Vincent Tao Hu,Bjorn Ommer","doi":"10.1109/tpami.2026.3658965","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658965","url":null,"abstract":"Diffusion Models are popular generative modeling methods in various vision tasks, attracting significant attention. They can be considered a unique instance of self-supervised learning methods due to their independence from label annotation. This survey explores the interplay between diffusion models and representation learning. It provides an overview of diffusion models' essential aspects, including mathematical foundations, popular denoising network architectures, and guidance methods. Various approaches related to diffusion models and representation learning are detailed. These include frameworks that leverage representations learned from pre-trained diffusion models for subsequent recognition tasks and methods that utilize advancements in representation and self-supervised learning to enhance diffusion models. This survey aims to offer a comprehensive overview of the taxonomy between diffusion models and representation learning, identifying key areas of existing concerns and potential exploration. Github link: https://github.com/dongzhuoyao/Diffusion-Representation-Learning-Survey-Taxonomy.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"94 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1109/tpami.2026.3658856
Yunfeng Ma,Min Liu,Shuai Jiang,Jingyu Zhou,Yuan Bian,Xueping Wang,Yaonan Wang
Multimodal anomaly detection (MAD) aims to exploit both texture and spatial attributes to identify deviations from normal patterns in complex scenarios. However, zero-shot (ZS) settings arising from privacy concerns or confidentiality constraints present significant challenges to existing MAD methods. To address this issue, we introduce ZUMA, a training-free, Zero-shot Unified Multimodal Anomaly detection framework that unleashes CLIP's cross-modal potential to perform ZS MAD. To mitigate the domain gap between CLIP's pretraining space and point clouds, we propose cross-domain calibration (CDC), which efficiently bridges the manifold misalignment through source-domain semantic transfer and establishes a hybrid semantic space, enabling a joint embedding of 2D and 3D representations. Subsequently, ZUMA performs dynamic semantic interaction (DSI) to enable structural decoupling of anomaly regions in the high-dimensional embedding space constructed by CDC, where natural languages serve as semantic anchors to help DSI establish discriminative hyperplanes within hybrid modality representations. Within this framework, ZUMA enables plug-and-play detection of 2D, 3D or multimodal anomalies, without training or fine-tuning even for cross-dataset or incomplete-modality scenarios. Additionally, to further investigate the potential of the training-free ZUMA within the training-based paradigm, we develop ZUMA-FT, a fine-tuned variant that achieves notable improvements with minimal parameter trade-off. Extensive experiments are conducted on two MAD benchmarks, MVTec 3D-AD and Eyecandies. Notably, the training-free ZUMA achieves state-of-the-art (SOTA) performance on both datasets, outperforming existing ZS MAD methods, including training-based approaches. Moreover, ZUMA-FT further extends the performance boundary of ZUMA with only 6.75 M learnable parameters. Code is available at: https://github.com/yif-ma/ZUMA.
{"title":"ZUMA: Training-free Zero-shot Unified Multimodal Anomaly Detection.","authors":"Yunfeng Ma,Min Liu,Shuai Jiang,Jingyu Zhou,Yuan Bian,Xueping Wang,Yaonan Wang","doi":"10.1109/tpami.2026.3658856","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658856","url":null,"abstract":"Multimodal anomaly detection (MAD) aims to exploit both texture and spatial attributes to identify deviations from normal patterns in complex scenarios. However, zero-shot (ZS) settings arising from privacy concerns or confidentiality constraints present significant challenges to existing MAD methods. To address this issue, we introduce ZUMA, a training-free, Zero-shot Unified Multimodal Anomaly detection framework that unleashes CLIP's cross-modal potential to perform ZS MAD. To mitigate the domain gap between CLIP's pretraining space and point clouds, we propose cross-domain calibration (CDC), which efficiently bridges the manifold misalignment through source-domain semantic transfer and establishes a hybrid semantic space, enabling a joint embedding of 2D and 3D representations. Subsequently, ZUMA performs dynamic semantic interaction (DSI) to enable structural decoupling of anomaly regions in the high-dimensional embedding space constructed by CDC, where natural languages serve as semantic anchors to help DSI establish discriminative hyperplanes within hybrid modality representations. Within this framework, ZUMA enables plug-and-play detection of 2D, 3D or multimodal anomalies, without training or fine-tuning even for cross-dataset or incomplete-modality scenarios. Additionally, to further investigate the potential of the training-free ZUMA within the training-based paradigm, we develop ZUMA-FT, a fine-tuned variant that achieves notable improvements with minimal parameter trade-off. Extensive experiments are conducted on two MAD benchmarks, MVTec 3D-AD and Eyecandies. Notably, the training-free ZUMA achieves state-of-the-art (SOTA) performance on both datasets, outperforming existing ZS MAD methods, including training-based approaches. Moreover, ZUMA-FT further extends the performance boundary of ZUMA with only 6.75 M learnable parameters. Code is available at: https://github.com/yif-ma/ZUMA.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"61 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The absence of publicly available, large-scale, high-quality datasets for Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has significantly hindered the application of rapidly advancing deep learning techniques, which hold huge potential to unlock new capabilities in this field. This is primarily because collecting large volumes of diverse target samples from SAR images is prohibitively expensive, largely due to privacy concerns, the characteristics of microwave radar imagery perception, and the need for specialized expertise in data annotation. Throughout the history of SAR ATR research, there have been only a number of small datasets, mainly including targets like ships, airplanes, buildings, etc. There is only one vehicle dataset MSTAR collected in the 1990 s, which has been a valuable source for SAR ATR. To fill this gap, this paper introduces a large-scale, new dataset named ATRNet-STAR with 40 different vehicle categories collected under various realistic imaging conditions and scenes. It marks a substantial advancement in dataset scale and diversity, comprising over 190,000 well-annotated samples-$10times$ larger than its predecessor, the famous MSTAR. Building such a large dataset is a challenging task, and the data collection scheme will be detailed. Secondly, we illustrate the value of ATRNet-STAR via extensively evaluating the performance of 15 representative methods with 7 different experimental settings on challenging classification and detection benchmarks derived from the dataset. Finally, based on our extensive experiments, we identify valuable insights for SAR ATR and discuss potential future research directions in this field. We hope that the scale, diversity, and benchmark of ATRNet-STAR can significantly facilitate the advancement of SAR ATR.
{"title":"ATRNet-STAR: A Large Dataset and Benchmark Towards Remote Sensing Object Recognition in the Wild.","authors":"Yongxiang Liu,Weijie Li,Li Liu,Jie Zhou,Bowen Peng,Yafei Song,Xuying Xiong,Wei Yang,Tianpeng Liu,Zhen Liu,Xiang Li","doi":"10.1109/tpami.2026.3658649","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658649","url":null,"abstract":"The absence of publicly available, large-scale, high-quality datasets for Synthetic Aperture Radar Automatic Target Recognition (SAR ATR) has significantly hindered the application of rapidly advancing deep learning techniques, which hold huge potential to unlock new capabilities in this field. This is primarily because collecting large volumes of diverse target samples from SAR images is prohibitively expensive, largely due to privacy concerns, the characteristics of microwave radar imagery perception, and the need for specialized expertise in data annotation. Throughout the history of SAR ATR research, there have been only a number of small datasets, mainly including targets like ships, airplanes, buildings, etc. There is only one vehicle dataset MSTAR collected in the 1990 s, which has been a valuable source for SAR ATR. To fill this gap, this paper introduces a large-scale, new dataset named ATRNet-STAR with 40 different vehicle categories collected under various realistic imaging conditions and scenes. It marks a substantial advancement in dataset scale and diversity, comprising over 190,000 well-annotated samples-$10times$ larger than its predecessor, the famous MSTAR. Building such a large dataset is a challenging task, and the data collection scheme will be detailed. Secondly, we illustrate the value of ATRNet-STAR via extensively evaluating the performance of 15 representative methods with 7 different experimental settings on challenging classification and detection benchmarks derived from the dataset. Finally, based on our extensive experiments, we identify valuable insights for SAR ATR and discuss potential future research directions in this field. We hope that the scale, diversity, and benchmark of ATRNet-STAR can significantly facilitate the advancement of SAR ATR.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"42 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Causal interaction inference is prone to spurious causal interactions, due to the substantial confounders in a biological system. While many existing methods attempt to address misidentification challenges, there remains a notable lack of effective methods to infer causal interaction under latent/unobserved confounders. In this work, we propose a method to overcome such challenges to infer dynamical causality under latent confounders and further reconstruct the latent confounders from time-series data by developing an orthogonal decomposition theorem in a delay embedding space. This theoretical foundation ensures the causal detection for any high-dimensional system even with only two observed variables under many latent confounders, which is a long-standing problem in the field. In addition to the latent confounder problem, such a decomposition makes the coupled variables separable in the embedding space, thus also solving the non-separability problem of causal inference. Extensive validation of the CIC method is carried out using various real datasets, which all demonstrates its effectiveness to reconstruct real biological networks and unobserved confounders.
{"title":"Dynamical Causality under Latent Confounders for Biological Network Reconstruction.","authors":"Jinling Yan,Shao-Wu Zhang,Chihao Zhang,Weitian Huang,Jifan Shi,Luonan Chen","doi":"10.1109/tpami.2026.3658839","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658839","url":null,"abstract":"Causal interaction inference is prone to spurious causal interactions, due to the substantial confounders in a biological system. While many existing methods attempt to address misidentification challenges, there remains a notable lack of effective methods to infer causal interaction under latent/unobserved confounders. In this work, we propose a method to overcome such challenges to infer dynamical causality under latent confounders and further reconstruct the latent confounders from time-series data by developing an orthogonal decomposition theorem in a delay embedding space. This theoretical foundation ensures the causal detection for any high-dimensional system even with only two observed variables under many latent confounders, which is a long-standing problem in the field. In addition to the latent confounder problem, such a decomposition makes the coupled variables separable in the embedding space, thus also solving the non-separability problem of causal inference. Extensive validation of the CIC method is carried out using various real datasets, which all demonstrates its effectiveness to reconstruct real biological networks and unobserved confounders.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"524 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As speech translation (ST) systems become increasingly prevalent, understanding their vulnerabilities is crucial for ensuring robust and reliable communication. However, limited work has explored this issue in depth. This paper explores methods of compromising these systems through imperceptible audio manipulations. Specifically, we present two approaches: (1) adapting perturbation-based techniques used for automatic speech recognition (ASR) attacks to the ST context, making our work the first to apply this approach to ST, and (2) proposing a novel music generation-based method to guide targeted translation, while also conducting more practical over-the-air attacks in the physical world. Our experiments reveal that carefully crafted audio perturbations can mislead translation models to produce targeted, harmful outputs, while adversarial music achieve this goal more covertly, exploiting the natural imperceptibility of music. These attacks have proven effective across multiple languages and translation models, highlighting a systemic vulnerability in current ST architectures. Beyond immediate security concerns, our findings highlight broader challenges in the robustness and interpretability of neural speech systems. More details and samples can be found at https://adv-st.github.io.
{"title":"Exploring Security Vulnerabilities in Multilingual Speech Translation Systems Via Deceptive Inputs.","authors":"Chang Liu,Haolin Wu,Xi Yang,Kui Zhang,Cong Wu,Weiming Zhang,NengHai Yu,Tianwei Zhang,Qing Guo,Jie Zhang","doi":"10.1109/tpami.2026.3658817","DOIUrl":"https://doi.org/10.1109/tpami.2026.3658817","url":null,"abstract":"As speech translation (ST) systems become increasingly prevalent, understanding their vulnerabilities is crucial for ensuring robust and reliable communication. However, limited work has explored this issue in depth. This paper explores methods of compromising these systems through imperceptible audio manipulations. Specifically, we present two approaches: (1) adapting perturbation-based techniques used for automatic speech recognition (ASR) attacks to the ST context, making our work the first to apply this approach to ST, and (2) proposing a novel music generation-based method to guide targeted translation, while also conducting more practical over-the-air attacks in the physical world. Our experiments reveal that carefully crafted audio perturbations can mislead translation models to produce targeted, harmful outputs, while adversarial music achieve this goal more covertly, exploiting the natural imperceptibility of music. These attacks have proven effective across multiple languages and translation models, highlighting a systemic vulnerability in current ST architectures. Beyond immediate security concerns, our findings highlight broader challenges in the robustness and interpretability of neural speech systems. More details and samples can be found at https://adv-st.github.io.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"7 1","pages":""},"PeriodicalIF":23.6,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146069931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}