首页 > 最新文献

IEEE Transactions on Circuits and Systems for Video Technology最新文献

英文 中文
IEEE Circuits and Systems Society Information IEEE电路与系统学会信息
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-12-05 DOI: 10.1109/TCSVT.2025.3634931
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3634931","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3634931","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11278896","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Circuits and Systems Society Information IEEE电路与系统学会信息
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-10-31 DOI: 10.1109/TCSVT.2025.3623686
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3623686","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3623686","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 11","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11223417","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Circuits and Systems Society Information IEEE电路与系统学会信息
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-10-03 DOI: 10.1109/TCSVT.2025.3612531
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3612531","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3612531","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 10","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11192813","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Circuits and Systems Society Information IEEE电路与系统学会信息
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-09-09 DOI: 10.1109/TCSVT.2025.3600974
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3600974","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3600974","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11154653","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Transactions on Circuits and Systems for Video Technology Publication Information IEEE视频技术电路与系统汇刊
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-09-09 DOI: 10.1109/TCSVT.2025.3600972
{"title":"IEEE Transactions on Circuits and Systems for Video Technology Publication Information","authors":"","doi":"10.1109/TCSVT.2025.3600972","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3600972","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"C2-C2"},"PeriodicalIF":11.1,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11154656","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Circuits and Systems Society Information IEEE电路与系统学会信息
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-08-05 DOI: 10.1109/TCSVT.2025.3592055
{"title":"IEEE Circuits and Systems Society Information","authors":"","doi":"10.1109/TCSVT.2025.3592055","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3592055","url":null,"abstract":"","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 8","pages":"C3-C3"},"PeriodicalIF":11.1,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11114434","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144782148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Inserting: Learning Subject Embedding for Semantic-Fidelity Personalized Diffusion Generation 超越插入:学习主题嵌入的语义保真个性化扩散生成
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-22 DOI: 10.1109/TCSVT.2025.3588882
Yang Li;Songlin Yang;Wei Wang;Jing Dong
Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.
基于先进扩散模型(如稳定扩散)的文本到图像(tt2i)个性化,其目的是在各种提示下生成目标对象的图像,引起了广泛关注。然而,当用户需要为自己或宠物猫等特定对象生成个性化图像时,T2I模型无法准确地生成受试者保留的图像。主要问题是预训练的T2I模型没有学习到目标被试与其相应的视觉内容之间的T2I映射。即使提供了多个目标主体图像,以往的个性化方法要么无法准确拟合主体区域,要么失去了与T2I模型空间中其他现有概念的交互生成能力。例如,它们无法为给定的提示生成符合twi和语义保真度的图像,这些图像包含其他概念,如场景(“埃菲尔铁塔”)、动作(“拿着篮球”)和面部属性(“闭上眼睛”)。在本文中,我们专注于在稳定扩散模型中插入精确的交互式主题嵌入,用于使用一张图像进行语义保真度个性化生成。我们从两个角度解决这一挑战:主题明智的注意力丢失和语义保真令牌优化。具体来说,我们提出了一种基于主体的注意力缺失方法来引导主体嵌入到具有高度主体身份相似性和多元交互生成能力的流形上。然后,我们将一个主题表示优化为多个逐阶段令牌,每个令牌包含两个解纠缠的特征。这种对文本条件反射空间的扩展增强了语义控制,从而提高了语义保真度。我们在最具挑战性的主题上进行了广泛的实验,面部身份,以验证我们的结果显示出卓越的主题准确性和细粒度操作能力。我们进一步验证了我们的方法在各种非面部主题上的泛化性。
{"title":"Beyond Inserting: Learning Subject Embedding for Semantic-Fidelity Personalized Diffusion Generation","authors":"Yang Li;Songlin Yang;Wei Wang;Jing Dong","doi":"10.1109/TCSVT.2025.3588882","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588882","url":null,"abstract":"Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes (“Eiffel Tower”), actions (“holding a basketball”), and facial attributes (“eyes closed”). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12607-12621"},"PeriodicalIF":11.1,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DRLN: Disparity-Aware Rescaling Learning Network for Multi-View Video Coding Optimization 基于差分感知的多视点视频编码优化学习网络
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-15 DOI: 10.1109/TCSVT.2025.3588516
Shiwei Wang;Liquan Shen;Peiying Wu;Zhaoyi Tian;Feifeng Wang
Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.
由于涉及的数据量很大,多视图视频数据的有效压缩是各种应用的关键挑战。尽管多视图视频编码(MVC)已经引入了视图间预测技术来减少视频冗余,但通过非对称重新缩放以较低分辨率对视图子集进行编码,从而实现更高的压缩效率,可以进一步减少冗余。然而,现有的基于网络的缩放方法是专为单视点视频设计的。这些方法忽略了多视点视频固有的视点间特征,导致性能不理想。为了解决这个问题,我们首先提出了一个差异感知重缩放学习网络(DRLN),该网络集成了差异感知特征提取和多分辨率自适应重缩放,通过最小化自冗余和互冗余来提高MVC效率。一方面,在编码阶段,我们的方法利用了多视图上下文的非局部相关性,并通过早期退出机制进行自适应降尺度,从而大大节省了多视图比特率。另一方面,在解码阶段,提出了一种动态聚合策略,以促进与访谈视图特征的有效交互,利用访谈视图和跨尺度信息重构细粒度多视图视频。大量实验表明,与3D-HEVC标准基线相比,我们的网络实现了26.31%的BD-Rate降低,提供了最先进的编码性能。
{"title":"DRLN: Disparity-Aware Rescaling Learning Network for Multi-View Video Coding Optimization","authors":"Shiwei Wang;Liquan Shen;Peiying Wu;Zhaoyi Tian;Feifeng Wang","doi":"10.1109/TCSVT.2025.3588516","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588516","url":null,"abstract":"Efficient compression of multi-view video data is a critical challenge for various applications due to the large volume of data involved. Although multi-view video coding (MVC) has introduced inter-view prediction techniques to reduce video redundancies, further reduction can be achieved by encoding a subset of views at a lower resolution through asymmetric rescaling, achieving higher compression efficiency. However, existing network-based rescaling approaches are designed solely for single-viewpoint videos. These methods neglect inter-view characteristics inherent in multi-view videos, resulting in suboptimal performance. To address this issue, we first propose a Disparity-aware Rescaling Learning Network (DRLN) that integrates disparity-aware feature extraction and multi-resolution adaptive rescaling to enhance MVC efficiency by minimizing both self- and inter-view redundancies. On the one hand, during the encoding stage, our method leverages the non-local correlation of multi-view contexts and performs adaptive downscaling with an early-exit mechanism, resulting in substantial multi-view bitrate savings. On the other hand, during the decoding stage, a dynamic aggregation strategy is proposed to facilitate effective interaction with inter-view features, utilizing the inter-view and cross-scale information to reconstruct fine-grained multi-view videos. Extensive experiments show that our network achieves a significant 26.31% BD-Rate reduction compared to the 3D-HEVC standard baseline, offering state of-the-art coding performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12788-12801"},"PeriodicalIF":11.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UP-Person: Unified Parameter-Efficient Transfer Learning for Text-Based Person Retrieval UP-Person:基于文本的人物检索的统一参数高效迁移学习
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-15 DOI: 10.1109/TCSVT.2025.3588406
Yating Liu;Yaowei Li;Xiangyuan Lan;Wenming Yang;Zimo Liu;Qingmin Liao
Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel Unified Parameter-Efficient Transfer Learning (PETL) method for Text-based Person Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at https://github.com/Liu-Yating/UP-Person.
基于文本的人物检索(text -based Person Retrieval, TPR)作为一种多模态任务,旨在从给定文本描述的候选图像池中检索目标人物,近年来由于对比视觉语言预训练模型的进展而引起了人们的广泛关注。先前的工作利用预训练的CLIP来提取人的视觉和文本特征,并对整个网络进行完全微调,与单模态预训练模型相比,这已经显示出显着的性能改进。但是,对大型模型进行全调优容易出现过拟合,影响泛化能力。本文提出了一种新的基于文本的人物检索(UP-Person)的统一参数高效迁移学习(PETL)方法,以彻底迁移来自CLIP的多模态知识。具体来说,UP-Person同时集成了三个轻量级的PETL组件,包括Prefix、LoRA和Adapter,其中Prefix和LoRA一起设计用于使用特定于任务的信息提示挖掘本地信息,Adapter用于调整全局特征表示。此外,还对两个vanilla子模块进行了优化,以适应TPR的统一架构。首先,S-Prefix的提出提高了前缀的关注度,增强了前缀令牌的梯度传播,提高了普通前缀的灵活性和性能;另一方面,L-Adapter与层规范化并行设计,调整整体分布,可以解决多个子模块之间的重叠和交互造成的冲突。大量的实验结果表明,我们的UP-Person在各种人物检索数据集(包括中大- pedes, ICFG-PEDES和RSTPReid)上取得了最先进的结果,而仅微调了4.7%的参数。代码可从https://github.com/Liu-Yating/UP-Person获得。
{"title":"UP-Person: Unified Parameter-Efficient Transfer Learning for Text-Based Person Retrieval","authors":"Yating Liu;Yaowei Li;Xiangyuan Lan;Wenming Yang;Zimo Liu;Qingmin Liao","doi":"10.1109/TCSVT.2025.3588406","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588406","url":null,"abstract":"Text-based Person Retrieval (TPR) as a multi-modal task, which aims to retrieve the target person from a pool of candidate images given a text description, has recently garnered considerable attention due to the progress of contrastive visual-language pre-trained model. Prior works leverage pre-trained CLIP to extract person visual and textual features and fully fine-tune the entire network, which have shown notable performance improvements compared to uni-modal pre-training models. However, full-tuning a large model is prone to overfitting and hinders the generalization ability. In this paper, we propose a novel <italic>U</i>nified <italic>P</i>arameter-Efficient Transfer Learning (PETL) method for Text-based <italic>Person</i> Retrieval (UP-Person) to thoroughly transfer the multi-modal knowledge from CLIP. Specifically, UP-Person simultaneously integrates three lightweight PETL components including Prefix, LoRA and Adapter, where Prefix and LoRA are devised together to mine local information with task-specific information prompts, and Adapter is designed to adjust global feature representations. Additionally, two vanilla submodules are optimized to adapt to the unified architecture of TPR. For one thing, S-Prefix is proposed to boost attention of prefix and enhance the gradient propagation of prefix tokens, which improves the flexibility and performance of the vanilla prefix. For another thing, L-Adapter is designed in parallel with layer normalization to adjust the overall distribution, which can resolve conflicts caused by overlap and interaction among multiple submodules. Extensive experimental results demonstrate that our UP-Person achieves state-of-the-art results across various person retrieval datasets, including CUHK-PEDES, ICFG-PEDES and RSTPReid while merely fine-tuning 4.7% parameters. Code is available at <uri>https://github.com/Liu-Yating/UP-Person</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12874-12889"},"PeriodicalIF":11.1,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BFSTAL: Bidirectional Feature Splitting With Cross-Layer Fusion for Temporal Action Localization 基于跨层融合的双向特征分割与时间动作定位
IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC Pub Date : 2025-07-14 DOI: 10.1109/TCSVT.2025.3588710
Jinglin Xu;Yaqi Zhang;Wenhao Zhou;Hongmin Liu
Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.
时间动作定位(TAL)旨在识别未修剪视频中动作的边界及其相应的类别。大多数现有的方法同时处理过去和未来的信息,忽略了行动发生的固有顺序性。这种对过去和未来信息的混淆处理阻碍了模型有效理解操作过程的能力。为了解决这些问题,我们提出了一种新的基于曼巴的双向特征分割方法BFSTAL (Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization),该方法由分解双向混合(DBH)和跨层融合检测(Cross-Layer Fusion Detection, CLFD)两个核心部分组成,显著增强了模型对动作过程的理解能力,特别是对动作时间边界的定位能力。具体来说,我们引入了双向混合分解(DBH)组件,该组件将给定时间戳的视频特征分解为向前特征(过去信息)和向后特征(未来信息)。DBH集成了三个关键模块:双向多头自注意(Bi-MHSA)、双向状态空间模型(Bi-SSM)和双向卷积(Bi-CONV)。DBH通过结合状态空间建模、注意机制和卷积网络有效地捕获远程依赖关系,同时提高时空感知。此外,我们提出了跨层融合检测(Cross-Layer Fusion Detection, CLFD),它聚合了来自不同金字塔层次的多尺度特征,增强了上下文理解和时间动作定位精度。大量实验表明,BFSTAL在四种广泛使用的TAL基准测试上优于其他方法:THUMOS14、EPIC-KITCHENS 100、Charades和MultiTHUMOS。
{"title":"BFSTAL: Bidirectional Feature Splitting With Cross-Layer Fusion for Temporal Action Localization","authors":"Jinglin Xu;Yaqi Zhang;Wenhao Zhou;Hongmin Liu","doi":"10.1109/TCSVT.2025.3588710","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3588710","url":null,"abstract":"Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 12","pages":"12707-12718"},"PeriodicalIF":11.1,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Circuits and Systems for Video Technology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1