Pub Date : 2025-06-11DOI: 10.1109/TCSVT.2025.3577724
Jie Li;Zhixin Li;Zhi Liu;Peng Yuan Zhou;Richang Hong;Qiyue Li;Han Hu
Volumetric video, also referred to as hologram video, is an emerging medium that represents 3D content in extended reality. As a next-generation video technology, it is poised to become a key application in 5G and future wireless communication networks. Because each user generally views only a specific portion of the volumetric video, known as the viewport, accurate prediction of the viewport is crucial for ensuring an optimal streaming performance. Despite its significance, research in this area is still in the early stages. To this end, this paper introduces a novel approach called Saliency and Trajectory-based Viewport Prediction (STVP), which enhances the accuracy of viewport prediction in volumetric video streaming by effectively leveraging both video saliency and viewport trajectory information. In particular, we first introduce a novel sampling method, Uniform Random Sampling (URS), which efficiently preserves video features while minimizing computational complexity. Next, we propose a saliency detection technique that integrates both spatial and temporal information to identify visually static and dynamic geometric and luminance-salient regions. Finally, we fuse saliency and trajectory information to achieve more accurate viewport prediction. Extensive experimental results validate the superiority of our method over existing state-of-the-art schemes. To the best of our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. We also make the source code of this work publicly available.
{"title":"Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and User Trajectory Information","authors":"Jie Li;Zhixin Li;Zhi Liu;Peng Yuan Zhou;Richang Hong;Qiyue Li;Han Hu","doi":"10.1109/TCSVT.2025.3577724","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3577724","url":null,"abstract":"Volumetric video, also referred to as hologram video, is an emerging medium that represents 3D content in extended reality. As a next-generation video technology, it is poised to become a key application in 5G and future wireless communication networks. Because each user generally views only a specific portion of the volumetric video, known as the viewport, accurate prediction of the viewport is crucial for ensuring an optimal streaming performance. Despite its significance, research in this area is still in the early stages. To this end, this paper introduces a novel approach called Saliency and Trajectory-based Viewport Prediction (STVP), which enhances the accuracy of viewport prediction in volumetric video streaming by effectively leveraging both video saliency and viewport trajectory information. In particular, we first introduce a novel sampling method, Uniform Random Sampling (URS), which efficiently preserves video features while minimizing computational complexity. Next, we propose a saliency detection technique that integrates both spatial and temporal information to identify visually static and dynamic geometric and luminance-salient regions. Finally, we fuse saliency and trajectory information to achieve more accurate viewport prediction. Extensive experimental results validate the superiority of our method over existing state-of-the-art schemes. To the best of our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. We also make the source code of this work publicly available.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12816-12829"},"PeriodicalIF":11.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-11DOI: 10.1109/TCSVT.2025.3578726
Lixin Zhang;Qian Wang
Accurate segmentation of diverse structures in pathological images is crucial for medical analysis. While widely used RGB images offer high spatial resolution, microscopic hyperspectral images (MHSIs) provide unique biomedical spectral signatures. Existing multi-modal segmentation methods, however, often suffer from insufficient uni-modal learning, ineffective cross-modal interaction, and nonadaptive multi-modal fusion. Therefore, we propose a novel synergistic multi-modal learning paradigm for co-registered RGB-MHSIs, instantiated within the Synergistic Fusion Network (SyFusNet) which comprises: modality-specific modules and objectives to ensure uni-modal feature extraction, the Mutual Knowledge Sharing Module (MKSM) for explicit cross-modal interaction, and the Adaptive Dual-level Co-decision Module (ADCM) for collaborative multi-modal segmentation. Alongside uni-modal learning, MKSM disentangles MHSI- and RGB-specific features into band- and position-aware guidance, respectively, sharing as cross-modal knowledge to enhance each other’s representations. To fuse multi-modal predictions, ADCM generates global attention from integrated multi-modal features to adaptively refine decision-level outputs, yielding reliable segmentation. Experiments demonstrate that SyFusNet outperforms state-of-the-art methods with statistical significance $boldsymbol {(p lt 0.01)}$ , achieving relative IoU gains of 9.35%, 4.63%, and 2.47% on the public PLGC, MDC, and WBC datasets, respectively, while also exhibiting strong generalizability and diagnostic potential through practical applications in multi-class segmentation and tumor regression grading.
{"title":"Synergistic Fusion Network of Microscopic Hyperspectral and RGB Images for Multi-Perspective Segmentation","authors":"Lixin Zhang;Qian Wang","doi":"10.1109/TCSVT.2025.3578726","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3578726","url":null,"abstract":"Accurate segmentation of diverse structures in pathological images is crucial for medical analysis. While widely used RGB images offer high spatial resolution, microscopic hyperspectral images (MHSIs) provide unique biomedical spectral signatures. Existing multi-modal segmentation methods, however, often suffer from insufficient uni-modal learning, ineffective cross-modal interaction, and nonadaptive multi-modal fusion. Therefore, we propose a novel synergistic multi-modal learning paradigm for co-registered RGB-MHSIs, instantiated within the <underline>Sy</u>nergistic <underline>Fus</u>ion <underline>Net</u>work (SyFusNet) which comprises: modality-specific modules and objectives to ensure uni-modal feature extraction, the Mutual Knowledge Sharing Module (MKSM) for explicit cross-modal interaction, and the Adaptive Dual-level Co-decision Module (ADCM) for collaborative multi-modal segmentation. Alongside uni-modal learning, MKSM disentangles MHSI- and RGB-specific features into band- and position-aware guidance, respectively, sharing as cross-modal knowledge to enhance each other’s representations. To fuse multi-modal predictions, ADCM generates global attention from integrated multi-modal features to adaptively refine decision-level outputs, yielding reliable segmentation. Experiments demonstrate that SyFusNet outperforms state-of-the-art methods with statistical significance <inline-formula> <tex-math>$boldsymbol {(p lt 0.01)}$ </tex-math></inline-formula>, achieving relative IoU gains of 9.35%, 4.63%, and 2.47% on the public PLGC, MDC, and WBC datasets, respectively, while also exhibiting strong generalizability and diagnostic potential through practical applications in multi-class segmentation and tumor regression grading.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12904-12917"},"PeriodicalIF":11.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Few-shot image generation aims to generate data of an unseen category based on only a few samples. Apart from basic content generation, a bunch of downstream applications hopefully benefit from this task, such as low-data detection and few-shot classification. To achieve this goal, the generated images should guarantee category retention for classification beyond the visual quality and diversity. In our preliminary work, we present an “editing-based” framework, Attribute Group Editing (AGE), for reliable few-shot image generation, which largely improves the performance compared with existing methods that require re-training a GAN with limited data. Nevertheless, AGE’s performance on downstream classification is not as satisfactory as expected. Furthermore, existing generative models suffer from similar issues. This paper focuses on addressing the issue of universal class inconsistency in all generative models. It not only improves AGE to enhance its ability to preserve class information but also conducts a comprehensive analysis of the causes of this problem in generative models from multiple perspectives, proposing potential directions for resolution. We first propose Stable Attribute Group Editing (SAGE) for more stable class-relevant image generation. SAGE corrects the inaccurate assumptions in AGE and leverages the distribution information from seen categories to accurately estimate the data distribution of unseen categories, thereby eliminating the class inconsistency issue in the generated data. We apply SAGE to both GANs and diffusion models to verify its flexibility and further achieve promising generation performance. Going one step further, we find that even though the generated images look photo-realistic and require no category-relevant editing, they are usually of limited help for downstream classification. We systematically discuss this issue from both the generation and classification perspectives, and propose to boost the downstream classification performance of SAGE by enhancing the pixel and frequency components. Extensive experiments provide valuable insights into extending image generation to wider downstream applications. Codes are available at https://github.com/UniBester/SAGE
few -shot image generation的目标是仅基于少量样本生成未知类别的数据。除了基本的内容生成之外,许多下游应用程序有望从该任务中受益,例如低数据检测和少镜头分类。为了实现这一目标,生成的图像应该在视觉质量和多样性之外保证分类的类别保留。在我们的初步工作中,我们提出了一个“基于编辑”的框架,即属性组编辑(AGE),用于可靠的少量图像生成,与需要使用有限数据重新训练GAN的现有方法相比,该框架在很大程度上提高了性能。然而,AGE在下游分类方面的表现并不如预期的那样令人满意。此外,现有的生成模型也存在类似的问题。本文的重点是解决所有生成模型中普遍类不一致的问题。不仅对AGE进行了改进,增强了其保存类信息的能力,而且从多个角度对生成模型中产生该问题的原因进行了全面分析,提出了可能的解决方向。我们首先提出稳定属性组编辑(SAGE)来生成更稳定的类相关图像。SAGE修正了AGE中不准确的假设,并利用来自已见类别的分布信息来准确估计未见类别的数据分布,从而消除了生成数据中的类不一致问题。我们将SAGE应用于gan和扩散模型,以验证其灵活性并进一步实现有希望的生成性能。更进一步,我们发现即使生成的图像看起来像照片,并且不需要与类别相关的编辑,它们通常对下游分类的帮助有限。本文从生成和分类两个方面对该问题进行了系统的讨论,并提出通过增强像元和频率分量来提高SAGE的下游分类性能。大量的实验为将图像生成扩展到更广泛的下游应用提供了有价值的见解。代码可在https://github.com/UniBester/SAGE上获得
{"title":"Stable Attribute Group Editing for Reliable Few-Shot Image Generation","authors":"Guanqi Ding;Xinzhe Han;Shuhui Wang;Xin Jin;Qingming Huang","doi":"10.1109/TCSVT.2025.3578670","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3578670","url":null,"abstract":"Few-shot image generation aims to generate data of an unseen category based on only a few samples. Apart from basic content generation, a bunch of downstream applications hopefully benefit from this task, such as low-data detection and few-shot classification. To achieve this goal, the generated images should guarantee category retention for classification beyond the visual quality and diversity. In our preliminary work, we present an “editing-based” framework, Attribute Group Editing (AGE), for reliable few-shot image generation, which largely improves the performance compared with existing methods that require re-training a GAN with limited data. Nevertheless, AGE’s performance on downstream classification is not as satisfactory as expected. Furthermore, existing generative models suffer from similar issues. This paper focuses on addressing the issue of universal class inconsistency in all generative models. It not only improves AGE to enhance its ability to preserve class information but also conducts a comprehensive analysis of the causes of this problem in generative models from multiple perspectives, proposing potential directions for resolution. We first propose Stable Attribute Group Editing (SAGE) for more stable class-relevant image generation. SAGE corrects the inaccurate assumptions in AGE and leverages the distribution information from seen categories to accurately estimate the data distribution of unseen categories, thereby eliminating the class inconsistency issue in the generated data. We apply SAGE to both GANs and diffusion models to verify its flexibility and further achieve promising generation performance. Going one step further, we find that even though the generated images look photo-realistic and require no category-relevant editing, they are usually of limited help for downstream classification. We systematically discuss this issue from both the generation and classification perspectives, and propose to boost the downstream classification performance of SAGE by enhancing the pixel and frequency components. Extensive experiments provide valuable insights into extending image generation to wider downstream applications. Codes are available at <uri>https://github.com/UniBester/SAGE</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12719-12733"},"PeriodicalIF":11.1,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-09DOI: 10.1109/TCSVT.2025.3578153
Gang He;Long Gao;Langkun Chen;Yan Jiang;Weiying Xie;Yunsong Li
Hyperspectral videos contain a larger number of spectral bands, providing extensive spectral information and material identification capabilities. This advantage confers hyperspectral trackers to achieve superior performance in challenging tracking scenarios. However, the limited availability of hyperspectral training data and the inability of existing algorithms to fully exploit hyperspectral information restrict the tracking performance. To address this issue, a novel framework, Spectral Prompt-based Hyperspectral Object Tracking (SP-HST), is proposed. SP-HST leverages a RGB tracking network as the main branch for feature extraction and tracking, which accounts for more than 98% of the total parameters and remains frozen during the training procedure. Additionally, the Spectral Prompt Learning (SPL) branch, comprising multiple lightweight prompt blocks, is introduced to generate complementary spectral representations as the prompt. The prompts contain abundant spectral information from hyperspectral data, enhancing the discriminative ability of features within the main branch. Furthermore, the Complementary Weight Learning (CWL) is employed to calculate the importance of spectral information from different prompts, enabling the features for hyperspectral object tracking to contain more spectral information that is absent in the feature of the main branch. By utilizing the spectral information as prompt, the number of trainable parameters is less than 2% of that in the tracking network, and the convergence is reached in 12 training epoch. Extensive experiments demonstrate the superiority of SP-HST, achieving a new state-of-the-art tracking performance, 71.3% of the AUC score on the HOTC dataset and 96.7% of the DP@20P score on the IMEC25 dataset. The code will be released at https://github.com/lgao001/SP-HST
{"title":"Hyperspectral Object Tracking With Spectral Information Prompt","authors":"Gang He;Long Gao;Langkun Chen;Yan Jiang;Weiying Xie;Yunsong Li","doi":"10.1109/TCSVT.2025.3578153","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3578153","url":null,"abstract":"Hyperspectral videos contain a larger number of spectral bands, providing extensive spectral information and material identification capabilities. This advantage confers hyperspectral trackers to achieve superior performance in challenging tracking scenarios. However, the limited availability of hyperspectral training data and the inability of existing algorithms to fully exploit hyperspectral information restrict the tracking performance. To address this issue, a novel framework, Spectral Prompt-based Hyperspectral Object Tracking (SP-HST), is proposed. SP-HST leverages a RGB tracking network as the main branch for feature extraction and tracking, which accounts for more than 98% of the total parameters and remains frozen during the training procedure. Additionally, the Spectral Prompt Learning (SPL) branch, comprising multiple lightweight prompt blocks, is introduced to generate complementary spectral representations as the prompt. The prompts contain abundant spectral information from hyperspectral data, enhancing the discriminative ability of features within the main branch. Furthermore, the Complementary Weight Learning (CWL) is employed to calculate the importance of spectral information from different prompts, enabling the features for hyperspectral object tracking to contain more spectral information that is absent in the feature of the main branch. By utilizing the spectral information as prompt, the number of trainable parameters is less than 2% of that in the tracking network, and the convergence is reached in 12 training epoch. Extensive experiments demonstrate the superiority of SP-HST, achieving a new state-of-the-art tracking performance, 71.3% of the AUC score on the HOTC dataset and 96.7% of the DP@20P score on the IMEC25 dataset. The code will be released at <uri>https://github.com/lgao001/SP-HST</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12636-12651"},"PeriodicalIF":11.1,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-09DOI: 10.1109/TCSVT.2025.3573482
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3573482","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3573482","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"C3-C3"},"PeriodicalIF":8.3,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11028137","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-09DOI: 10.1109/TCSVT.2025.3573480
{"title":"IEEE Transactions on Circuits and Systems for Video Technology Publication Information","authors":"","doi":"10.1109/TCSVT.2025.3573480","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3573480","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"C2-C2"},"PeriodicalIF":8.3,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11028632","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-04DOI: 10.1109/TCSVT.2025.3576344
Shuai Yuan;Guangyong Gao;Yimin Yu;Zhihua Xia
With the popularization of digital information, reversible data hiding in ciphertext has become a critical research focus in privacy protection in cloud storage. A reversible data hiding method for encrypted images is proposed: Reversible Data Hiding in Encrypted Images with Adaptive Multi-directional MED and Huffman Code based on Interval-Wise Dynamic Prediction Axes (RDHEI-AHIDA). Firstly, the original image is predicted by the gradient Adaptive Multi-Directional Median Edge Detector (AM-MED) to obtain the critical gradient and the position of the Interval-wise Dynamic Prediction Axes (IDP-Axes). Then, information bits are allocated at intervals on the IDP-Axes. Combining the determined position of the IDP-Axes and the critical gradient, the prediction error values of the original image are calculated and recorded. After the image is encrypted, according to the distribution of prediction error values, an adaptive Huffman code rule is established, and pixel marking, classification and auxiliary information embedding are carried out. Finally, the secret data is embedded by the bit replacement method. Compared with the state-of-the-art RDHEI methods, experimental results show that RDHEI-AHIDA not only provides a higher pure payload while ensuring security but also exhibits certain robustness.
{"title":"Reversible Data Hiding in Encrypted Images With Adaptive Multi-Directional MED and Huffman Code Based on Interval-Wise Dynamic Prediction Axes","authors":"Shuai Yuan;Guangyong Gao;Yimin Yu;Zhihua Xia","doi":"10.1109/TCSVT.2025.3576344","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3576344","url":null,"abstract":"With the popularization of digital information, reversible data hiding in ciphertext has become a critical research focus in privacy protection in cloud storage. A reversible data hiding method for encrypted images is proposed: Reversible Data Hiding in Encrypted Images with Adaptive Multi-directional MED and Huffman Code based on Interval-Wise Dynamic Prediction Axes (RDHEI-AHIDA). Firstly, the original image is predicted by the gradient Adaptive Multi-Directional Median Edge Detector (AM-MED) to obtain the critical gradient and the position of the Interval-wise Dynamic Prediction Axes (IDP-Axes). Then, information bits are allocated at intervals on the IDP-Axes. Combining the determined position of the IDP-Axes and the critical gradient, the prediction error values of the original image are calculated and recorded. After the image is encrypted, according to the distribution of prediction error values, an adaptive Huffman code rule is established, and pixel marking, classification and auxiliary information embedding are carried out. Finally, the secret data is embedded by the bit replacement method. Compared with the state-of-the-art RDHEI methods, experimental results show that RDHEI-AHIDA not only provides a higher pure payload while ensuring security but also exhibits certain robustness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"11708-11722"},"PeriodicalIF":11.1,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-04DOI: 10.1109/TCSVT.2025.3576619
Ming Jin;Lei Zhu;Richang Hong
Video, as an information carrier, provides a vast amount of important information to people. Therefore, the method of obtaining video becomes particularly important, which drives the research on text-video cross-modal retrieval technology. However, current text-video cross-modal retrieval models still face several issues. First, these models do not fully utilize the powerful reasoning and generative capabilities of large models to address the issues of missing critical objects and insufficient high-quality video-text paired training data. Second, existing retrieval models do not adequately research the bidirectional cross-modal semantic interaction and reasoning mechanism, which hinders the ability to fully capture and learn the implicit semantic features between different modalities. To address these issues, we propose an innovative bidirectional semantic reasoning and large model data augmentation cross-modal retrieval model (BiSeR-LMA). This model first leverages the strong reasoning and generative capabilities of large models to perform semantic reasoning on the textual descriptions of videos, then generates multiple semantically rich video frames, thereby compensating for the missing critical objects in the original video and improving the quality of video-text paired training data. Second, we design a bidirectional text-video semantic reasoning module, which uses features from one modality as auxiliary information to assist the model in reasoning the implicit semantic information of another modality. This enhances the model’s capability to establish semantic relationships and perform reasoning on implicit semantics, promoting text-video semantic alignment. Finally, we verify the effectiveness of the proposed cross-modal retrieval model on the MSR-VTT, LSMDC, and MSVD datasets.
{"title":"BiSeR-LMA: A Bidirectional Semantic Reasoning and Large Model Enhancement Approach for Text-Video Cross-Modal Retrieval","authors":"Ming Jin;Lei Zhu;Richang Hong","doi":"10.1109/TCSVT.2025.3576619","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3576619","url":null,"abstract":"Video, as an information carrier, provides a vast amount of important information to people. Therefore, the method of obtaining video becomes particularly important, which drives the research on text-video cross-modal retrieval technology. However, current text-video cross-modal retrieval models still face several issues. First, these models do not fully utilize the powerful reasoning and generative capabilities of large models to address the issues of missing critical objects and insufficient high-quality video-text paired training data. Second, existing retrieval models do not adequately research the bidirectional cross-modal semantic interaction and reasoning mechanism, which hinders the ability to fully capture and learn the implicit semantic features between different modalities. To address these issues, we propose an innovative bidirectional semantic reasoning and large model data augmentation cross-modal retrieval model (BiSeR-LMA). This model first leverages the strong reasoning and generative capabilities of large models to perform semantic reasoning on the textual descriptions of videos, then generates multiple semantically rich video frames, thereby compensating for the missing critical objects in the original video and improving the quality of video-text paired training data. Second, we design a bidirectional text-video semantic reasoning module, which uses features from one modality as auxiliary information to assist the model in reasoning the implicit semantic information of another modality. This enhances the model’s capability to establish semantic relationships and perform reasoning on implicit semantics, promoting text-video semantic alignment. Finally, we verify the effectiveness of the proposed cross-modal retrieval model on the MSR-VTT, LSMDC, and MSVD datasets.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"11655-11666"},"PeriodicalIF":11.1,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-04DOI: 10.1109/TCSVT.2025.3576354
Dingcheng Gao;Yanjun Qin;Xiaoming Tao;Jianhua Lu
The likelihood of encountering scenarios that lead to accidents, namely safety-critical scenarios, is minimal compared to long-term safe driving environments. The generation of repeatable and scalable safety-critical scenarios is essential for the advancement of human and autonomous driving capabilities. Compared with the high complexity and low practicality of existing scenario generation methods, in this paper we propose a real-time approach to automatically generate challenging scenarios and instantiate them in a CARLA-based simulator. First, the safety-critical scenario is decomposed into a perturbed and optimized vehicle trajectory and the remaining reusable Unreal Engine assets based on a hierarchical model. Second, a model that is based on a graph conditional variational autoencoder (VAE) is employed to predict future trajectories and head angles based on past information. Third, the safety-critical scene generation model is used to enhance the diversity of the scene by diversifying the latent variables over a pre-trained trajectory representation model. Finally, the trajectories of real-world vehicles are placed into the simulator by adapting them to enable the generation of safety-critical scenes in a three-dimensional environment. The results demonstrate that the proposed approach generates scenarios that are more plausible than those generated by the baselines, with a performance improvement of over 10% in collision metrics for scenario generation. The research facilitates the simplification of the long-tail scenario construction process for autonomous vehicles, which in turn facilitates the optimization of algorithms such as autonomous trajectory planning.
{"title":"Diversifying Latent Flows for Safety-Critical Scenarios Generation With CARLA Simulator","authors":"Dingcheng Gao;Yanjun Qin;Xiaoming Tao;Jianhua Lu","doi":"10.1109/TCSVT.2025.3576354","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3576354","url":null,"abstract":"The likelihood of encountering scenarios that lead to accidents, namely safety-critical scenarios, is minimal compared to long-term safe driving environments. The generation of repeatable and scalable safety-critical scenarios is essential for the advancement of human and autonomous driving capabilities. Compared with the high complexity and low practicality of existing scenario generation methods, in this paper we propose a real-time approach to automatically generate challenging scenarios and instantiate them in a CARLA-based simulator. First, the safety-critical scenario is decomposed into a perturbed and optimized vehicle trajectory and the remaining reusable Unreal Engine assets based on a hierarchical model. Second, a model that is based on a graph conditional variational autoencoder (VAE) is employed to predict future trajectories and head angles based on past information. Third, the safety-critical scene generation model is used to enhance the diversity of the scene by diversifying the latent variables over a pre-trained trajectory representation model. Finally, the trajectories of real-world vehicles are placed into the simulator by adapting them to enable the generation of safety-critical scenes in a three-dimensional environment. The results demonstrate that the proposed approach generates scenarios that are more plausible than those generated by the baselines, with a performance improvement of over 10% in collision metrics for scenario generation. The research facilitates the simplification of the long-tail scenario construction process for autonomous vehicles, which in turn facilitates the optimization of algorithms such as autonomous trajectory planning.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"11723-11736"},"PeriodicalIF":11.1,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-02DOI: 10.1109/TCSVT.2025.3575082
Dazhi Xu;Ming Li;Yan Wu;Peng Zhang;Xinyue Xin
Polarimetric synthetic aperture radar (PolSAR) image change detection (CD) aims to accurately analyze the difference and detect changes in PolSAR images. Recently, graph transformer (GT), which combines the advantages of graph convolutional network and transformer, has increasingly attracted attention in the field of remote sensing. However, the direct application of GT for PolSAR image CD with limited training samples is challenging owing to polarimetric scattering confusion and random speckle noise. Here, we propose a novel unsupervised representation learning framework for CD in PolSAR images, named statistic-guided difference enhancement GT (SDEGT). Our motivation is that polarimetric statistics can effectively guide GT to extract robust and highly discriminative features from the raw polarimetric graphs and thus accurately detect changes. The SDEGT follows the architecture based on neighborhood aggregation GT and innovatively introduces polarimetric statistics to guide feature difference enhancement, thereby capturing the structural interaction between graph nodes and aggregating the local-to-global change correlations at low computational cost. First, SDEGT innovatively introduces noise-robust polarimetric statistics to improve its noise suppression ability and learn sufficient change-aware features from the PolSAR data. Subsequently, guided by the polarimetric statistical difference, a difference enhancement module (DEM) is designed and cleverly embedded in the SDEGT to adaptively enhance the difference between changed and unchanged nodes, thus improving the discrimination of the change-aware features. Finally, symmetric cross-entropy (SCE) is employed to facilitate the robust learning of SDEGT and attenuate the detrimental effect of label noise. Visual and quantitative experimental results on five measured PolSAR datasets with different scenes and dimensions demonstrate the competitiveness of our SDEGT over other state-of-the-art methods.
{"title":"Statistic-Guided Difference Enhancement Graph Transformer for Unsupervised Change Detection in PolSAR Images","authors":"Dazhi Xu;Ming Li;Yan Wu;Peng Zhang;Xinyue Xin","doi":"10.1109/TCSVT.2025.3575082","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3575082","url":null,"abstract":"Polarimetric synthetic aperture radar (PolSAR) image change detection (CD) aims to accurately analyze the difference and detect changes in PolSAR images. Recently, graph transformer (GT), which combines the advantages of graph convolutional network and transformer, has increasingly attracted attention in the field of remote sensing. However, the direct application of GT for PolSAR image CD with limited training samples is challenging owing to polarimetric scattering confusion and random speckle noise. Here, we propose a novel unsupervised representation learning framework for CD in PolSAR images, named statistic-guided difference enhancement GT (SDEGT). Our motivation is that polarimetric statistics can effectively guide GT to extract robust and highly discriminative features from the raw polarimetric graphs and thus accurately detect changes. The SDEGT follows the architecture based on neighborhood aggregation GT and innovatively introduces polarimetric statistics to guide feature difference enhancement, thereby capturing the structural interaction between graph nodes and aggregating the local-to-global change correlations at low computational cost. First, SDEGT innovatively introduces noise-robust polarimetric statistics to improve its noise suppression ability and learn sufficient change-aware features from the PolSAR data. Subsequently, guided by the polarimetric statistical difference, a difference enhancement module (DEM) is designed and cleverly embedded in the SDEGT to adaptively enhance the difference between changed and unchanged nodes, thus improving the discrimination of the change-aware features. Finally, symmetric cross-entropy (SCE) is employed to facilitate the robust learning of SDEGT and attenuate the detrimental effect of label noise. Visual and quantitative experimental results on five measured PolSAR datasets with different scenes and dimensions demonstrate the competitiveness of our SDEGT over other state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"11667-11684"},"PeriodicalIF":11.1,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}