Pub Date : 2025-12-05DOI: 10.1109/TCSVT.2025.3634931
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3634931","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3634931","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11278896","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/TCSVT.2025.3623686
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3623686","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3623686","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11223417","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-03DOI: 10.1109/TCSVT.2025.3612531
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3612531","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3612531","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 10","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11192813","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-09DOI: 10.1109/TCSVT.2025.3600974
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3600974","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3600974","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11154653","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-09DOI: 10.1109/TCSVT.2025.3600972
{"title":"IEEE Transactions on Circuits and Systems for Video Technology Publication Information","authors":"","doi":"10.1109/TCSVT.2025.3600972","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3600972","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"C2-C2"},"PeriodicalIF":11.1,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11154656","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-05DOI: 10.1109/TCSVT.2025.3592055
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3592055","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3592055","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11114434","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144782148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1109/TCSVT.2025.3588882
Yang Li;Songlin Yang;Wei Wang;Jing Dong
Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.
{"title":"Beyond Inserting: Learning Subject Embedding for Semantic-Fidelity Personalized Diffusion Generation","authors":"Yang Li;Songlin Yang;Wei Wang;Jing Dong","doi":"10.1109/TCSVT.2025.3588882","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588882","url":null,"abstract":"Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12607-12621"},"PeriodicalIF":11.1,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-15DOI: 10.1109/TCSVT.2025.3588516
Shiwei Wang;Liquan Shen;Peiying Wu;Zhaoyi Tian;Feifeng Wang
Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.
{"title":"DRLN: Disparity-Aware Rescaling Learning Network for Multi-View Video Coding Optimization","authors":"Shiwei Wang;Liquan Shen;Peiying Wu;Zhaoyi Tian;Feifeng Wang","doi":"10.1109/TCSVT.2025.3588516","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588516","url":null,"abstract":"Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12788-12801"},"PeriodicalIF":11.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.
基于文本的人物检索(text -based Person Retrieval, TPR)作为一种多模态任务,旨在从给定文本描述的候选图像池中检索目标人物,近年来由于对比视觉语言预训练模型的进展而引起了人们的广泛关注。先前的工作利用预训练的CLIP来提取人的视觉和文本特征,并对整个网络进行完全微调,与单模态预训练模型相比,这已经显示出显着的性能改进。但是,对大型模型进行全调优容易出现过拟合,影响泛化能力。本文提出了一种新的基于文本的人物检索(UP-Person)的统一参数高效迁移学习(PETL)方法,以彻底迁移来自CLIP的多模态知识。具体来说,UP-Person同时集成了三个轻量级的PETL组件,包括Prefix、LoRA和Adapter,其中Prefix和LoRA一起设计用于使用特定于任务的信息提示挖掘本地信息,Adapter用于调整全局特征表示。此外,还对两个vanilla子模块进行了优化,以适应TPR的统一架构。首先,S-Prefix的提出提高了前缀的关注度,增强了前缀令牌的梯度传播,提高了普通前缀的灵活性和性能;另一方面,L-Adapter与层规范化并行设计,调整整体分布,可以解决多个子模块之间的重叠和交互造成的冲突。大量的实验结果表明,我们的UP-Person在各种人物检索数据集(包括中大- pedes, ICFG-PEDES和RSTPReid)上取得了最先进的结果,而仅微调了4.7%的参数。代码可从https://github.com/Liu-Yating/UP-Person获得。
{"title":"UP-Person: Unified Parameter-Efficient Transfer Learning for Text-Based Person Retrieval","authors":"Yating Liu;Yaowei Li;Xiangyuan Lan;Wenming Yang;Zimo Liu;Qingmin Liao","doi":"10.1109/TCSVT.2025.3588406","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588406","url":null,"abstract":"Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel <italic>U</i>nified <italic>P</i>arameter-Efficient Transfer Learning (PETL) method for Text-based <italic>Person</i> Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at <uri>https://github.com/Liu-Yating/UP-Person</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12874-12889"},"PeriodicalIF":11.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-14DOI: 10.1109/TCSVT.2025.3588710
Jinglin Xu;Yaqi Zhang;Wenhao Zhou;Hongmin Liu
Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.
时间动作定位(TAL)旨在识别未修剪视频中动作的边界及其相应的类别。大多数现有的方法同时处理过去和未来的信息,忽略了行动发生的固有顺序性。这种对过去和未来信息的混淆处理阻碍了模型有效理解操作过程的能力。为了解决这些问题,我们提出了一种新的基于曼巴的双向特征分割方法BFSTAL (Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization),该方法由分解双向混合(DBH)和跨层融合检测(Cross-Layer Fusion Detection, CLFD)两个核心部分组成,显著增强了模型对动作过程的理解能力,特别是对动作时间边界的定位能力。具体来说,我们引入了双向混合分解(DBH)组件,该组件将给定时间戳的视频特征分解为向前特征(过去信息)和向后特征(未来信息)。DBH集成了三个关键模块:双向多头自注意(Bi-MHSA)、双向状态空间模型(Bi-SSM)和双向卷积(Bi-CONV)。DBH通过结合状态空间建模、注意机制和卷积网络有效地捕获远程依赖关系,同时提高时空感知。此外,我们提出了跨层融合检测(Cross-Layer Fusion Detection, CLFD),它聚合了来自不同金字塔层次的多尺度特征,增强了上下文理解和时间动作定位精度。大量实验表明,BFSTAL在四种广泛使用的TAL基准测试上优于其他方法:THUMOS14、EPIC-KITCHENS 100、Charades和MultiTHUMOS。
{"title":"BFSTAL: Bidirectional Feature Splitting With Cross-Layer Fusion for Temporal Action Localization","authors":"Jinglin Xu;Yaqi Zhang;Wenhao Zhou;Hongmin Liu","doi":"10.1109/TCSVT.2025.3588710","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588710","url":null,"abstract":"Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12707-12718"},"PeriodicalIF":11.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}