IEEE Transactions on Circuits and Systems for Video Technology最新文献_第10页

EDTformer: An Efficient Decoder Transformer for Visual Place Recognition EDTformer：一种用于视觉位置识别的高效解码器转换器

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3559084

Tong Jin;Feng Lu;Shuyu Hu;Chun Yuan;Yunpeng Liu

Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations. Specifically, we do this by formulating deep features as the keys and values, as well as a set of learnable parameters as the queries. Our EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to output the global representations. Moreover, to provide more powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-rank Parallel Adaptation (LoPA) method to enhance its performance in VPR, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at https://github.com/Tong-Jin01/EDTformer.

视觉位置识别（VPR）旨在通过从大型地理标记数据库中检索视觉上相似的图像来确定查询图像的大致地理位置。为了获得每个位置图像的全局表示，大多数方法通常侧重于通过使用当前著名的架构（例如，cnn, mlp，池化层和变压器编码器）从骨干提取的深度特征的聚合，而很少关注变压器解码器。然而，我们认为其强大的捕获上下文依赖关系和生成准确特征的能力对于VPR任务具有相当大的潜力。为此，我们提出了一种用于特征聚合的高效解码器变压器（EDTformer），它由几个堆叠的简化解码器块组成，然后是两个线性层，以直接产生鲁棒性和判别性的全局表示。具体来说，我们通过将深度特征作为键和值，以及一组可学习的参数作为查询来实现这一点。我们的EDTformer可以充分利用深层特征中的上下文信息，然后逐步解码和聚合有效特征到可学习的查询中，输出全局表示。此外，为了给EDTformer提供更强大的深度特征，进一步增强其鲁棒性，我们将基础模型DINOv2作为主干，并提出了一种低秩并行自适应（LoPA）方法来提高其在VPR中的性能，该方法可以以记忆和参数有效的方式逐步细化主干的中间特征。结果表明，该方法不仅在多个基准数据集上优于单阶段VPR方法，而且优于两阶段VPR方法，后者增加了重新排序的成本相当高。代码将在https://github.com/Tong-Jin01/EDTformer上提供。

{"title":"EDTformer: An Efficient Decoder Transformer for Visual Place Recognition","authors":"Tong Jin;Feng Lu;Shuyu Hu;Chun Yuan;Yunpeng Liu","doi":"10.1109/TCSVT.2025.3559084","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559084","url":null,"abstract":"Visual place recognition (VPR) aims to determine the general geographical location of a query image by retrieving visually similar images from a large geo-tagged database. To obtain a global representation for each place image, most approaches typically focus on the aggregation of deep features extracted from a backbone through using current prominent architectures (e.g., CNNs, MLPs, pooling layer, and transformer encoder), giving little attention to the transformer decoder. However, we argue that its strong capability to capture contextual dependencies and generate accurate features holds considerable potential for the VPR task. To this end, we propose an Efficient Decoder Transformer (EDTformer) for feature aggregation, which consists of several stacked simplified decoder blocks followed by two linear layers to directly produce robust and discriminative global representations. Specifically, we do this by formulating deep features as the keys and values, as well as a set of learnable parameters as the queries. Our EDTformer can fully utilize the contextual information within deep features, then gradually decode and aggregate the effective features into the learnable queries to output the global representations. Moreover, to provide more powerful deep features for EDTformer and further facilitate the robustness, we use the foundation model DINOv2 as the backbone and propose a Low-rank Parallel Adaptation (LoPA) method to enhance its performance in VPR, which can refine the intermediate features of the backbone progressively in a memory- and parameter-efficient way. As a result, our method not only outperforms single-stage VPR methods on multiple benchmark datasets, but also outperforms two-stage VPR methods which add a re-ranking with considerable cost. Code will be available at <uri>https://github.com/Tong-Jin01/EDTformer</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8835-8848"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Single-Server Private Inference Outsourcing for Convolutional Neural Networks 卷积神经网络的高效单服务器私有推理外包

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3559101

Xuanang Yang;Jing Chen;Yuqing Li;Kun He;Xiaojie Huang;Zikuan Jiang;Ruiying Du;Hao Bai

Private inference outsourcing ensures the privacy of both clients and model owners when model owners deliver inference services to clients through third-party cloud servers. Existing solutions either reduce inference accuracy due to model approximations or rely on the unrealistic assumption of non-colluding servers. Moreover, their efficiency falls short of HELiKs, a solution focused solely on client privacy protection. In this paper, we propose Skybolt, a single-server private inference outsourcing framework without resorting to model approximations, achieving greater efficiency than HELiKs. Skybolt is built upon efficient secure two-party computation protocols that safeguard the privacy of both clients and model owners. For the linear calculation protocol, we devise a ciphertext packing algorithm for homomorphic matrix multiplication, effectively reducing both computational and communication overheads. Additionally, our nonlinear calculation protocol features a lightweight online phase, involving only the addition and multiplication on secret shares. This stands in contrast to existing protocols, which entail resource-intensive techniques such as oblivious transfer. Extensive experiments on popular models, including ResNet50 and DenseNet121, show that Skybolt achieves a

$5.4-7.3 times $

reduction in inference latency, accompanied by a

$20.1-39.6 times $

decrease in communication cost compared to HELiKs.

当模型所有者通过第三方云服务器向客户提供推理服务时，私有推理外包确保了客户和模型所有者的隐私。现有的解决方案要么由于模型近似而降低推理精度，要么依赖于不现实的非串通服务器假设。此外，它们的效率不如helik，后者是一种专注于客户隐私保护的解决方案。在本文中，我们提出了Skybolt，这是一个单服务器私有推理外包框架，无需诉诸模型近似，实现了比helik更高的效率。Skybolt建立在高效安全的两方计算协议之上，保护客户端和模型所有者的隐私。对于线性计算协议，我们设计了一种用于同态矩阵乘法的密文打包算法，有效地减少了计算开销和通信开销。此外，我们的非线性计算协议具有轻量级在线阶段，仅涉及秘密共享的加法和乘法。这与现有协议形成鲜明对比，现有协议需要资源密集型技术，如遗忘转移。在流行模型（包括ResNet50和DenseNet121）上进行的大量实验表明，与HELiKs相比，Skybolt在推理延迟方面降低了5.4-7.3美元，同时通信成本降低了20.1-39.6美元。

{"title":"Efficient Single-Server Private Inference Outsourcing for Convolutional Neural Networks","authors":"Xuanang Yang;Jing Chen;Yuqing Li;Kun He;Xiaojie Huang;Zikuan Jiang;Ruiying Du;Hao Bai","doi":"10.1109/TCSVT.2025.3559101","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559101","url":null,"abstract":"Private inference outsourcing ensures the privacy of both clients and model owners when model owners deliver inference services to clients through third-party cloud servers. Existing solutions either reduce inference accuracy due to model approximations or rely on the unrealistic assumption of non-colluding servers. Moreover, their efficiency falls short of HELiKs, a solution focused solely on client privacy protection. In this paper, we propose Skybolt, a single-server private inference outsourcing framework without resorting to model approximations, achieving greater efficiency than HELiKs. Skybolt is built upon efficient secure two-party computation protocols that safeguard the privacy of both clients and model owners. For the linear calculation protocol, we devise a ciphertext packing algorithm for homomorphic matrix multiplication, effectively reducing both computational and communication overheads. Additionally, our nonlinear calculation protocol features a lightweight online phase, involving only the addition and multiplication on secret shares. This stands in contrast to existing protocols, which entail resource-intensive techniques such as oblivious transfer. Extensive experiments on popular models, including ResNet50 and DenseNet121, show that Skybolt achieves a <inline-formula> <tex-math>$5.4-7.3 times $ </tex-math></inline-formula> reduction in inference latency, accompanied by a <inline-formula> <tex-math>$20.1-39.6 times $ </tex-math></inline-formula> decrease in communication cost compared to HELiKs.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 10","pages":"10586-10598"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatio-Temporal Pyramid Keypoint Detection With Event Cameras 基于事件相机的时空金字塔关键点检测

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-09 DOI: 10.1109/TCSVT.2025.3559299

Yuqing Zhu;Yuan Gao;Tianle Ding;Xiang Liu;Wenfei Yang;Tianzhu Zhang

Event cameras are bio-inspired sensors with diverse advantages, including high temporal resolution and minimal power consumption. Therefore, event cameras enjoy a wide range of applications in computer vision, among which event keypoint detection plays a vital role. However, repeatable event keypoint detection remains challenging because the lack of temporal inter-frame interaction leads to descriptors with limited temporal consistency, which restricts the ability to perceive keypoint motion. Besides, detectors learned at single scale features are not suitable for event keypoints with significant motion speed differences in high-speed scenarios. To deal with these problems, we propose a novel Spatio-Temporal Pyramid Keypoint Detection Network (STPNet) for event cameras via a temporally consistent descriptor learning (TCL) module and a spatially diverse detector learning (SDL) module. The proposed STPNet enjoys several merits. First, the TCL module generates temporally consistent descriptors for specific keypoint motion patterns. Second, the SDL module produces spatially diverse detectors for applications in high-speed motion scenarios. Extensive experimental results on three challenging benchmarks show that our method notably outperforms state-of-the-art event keypoint detection methods. Specifically, our STPNet can outperform the best event keypoint detection method by 0.21px in reprj. error on Event-Camera, 4% in IoU on N-Caltech101, 0.13px in reprj. error on HVGA ATIS Corner and 5.94% in matching accuracy on DSEC.

事件相机是一种生物传感器，具有多种优势，包括高时间分辨率和最小功耗。因此，事件摄像机在计算机视觉中有着广泛的应用，其中事件关键点检测起着至关重要的作用。然而，可重复事件关键点检测仍然具有挑战性，因为缺乏时间帧间交互导致描述符具有有限的时间一致性，这限制了感知关键点运动的能力。此外，单尺度特征学习的检测器不适合高速场景中运动速度差异较大的事件关键点。为了解决这些问题，我们通过时间一致描述符学习（TCL）模块和空间多样化检测器学习（SDL）模块为事件相机提出了一种新的时空金字塔关键点检测网络（STPNet）。拟议的STPNet有几个优点。首先，TCL模块为特定的关键点运动模式生成暂时一致的描述符。其次，SDL模块为高速运动场景中的应用提供空间多样化的检测器。在三个具有挑战性的基准测试中进行的大量实验结果表明，我们的方法明显优于最先进的事件关键点检测方法。具体来说，我们的STPNet在reprj上比最好的事件关键点检测方法高出0.21像素。事件相机上的错误，N-Caltech101上的IoU为4%，reprj为0.13像素。在HVGA ATIS角上的匹配精度为5.94%，在DSEC上的匹配精度为5.94%。

{"title":"Spatio-Temporal Pyramid Keypoint Detection With Event Cameras","authors":"Yuqing Zhu;Yuan Gao;Tianle Ding;Xiang Liu;Wenfei Yang;Tianzhu Zhang","doi":"10.1109/TCSVT.2025.3559299","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3559299","url":null,"abstract":"Event cameras are bio-inspired sensors with diverse advantages, including high temporal resolution and minimal power consumption. Therefore, event cameras enjoy a wide range of applications in computer vision, among which event keypoint detection plays a vital role. However, repeatable event keypoint detection remains challenging because the lack of temporal inter-frame interaction leads to descriptors with limited temporal consistency, which restricts the ability to perceive keypoint motion. Besides, detectors learned at single scale features are not suitable for event keypoints with significant motion speed differences in high-speed scenarios. To deal with these problems, we propose a novel Spatio-Temporal Pyramid Keypoint Detection Network (STPNet) for event cameras via a temporally consistent descriptor learning (TCL) module and a spatially diverse detector learning (SDL) module. The proposed STPNet enjoys several merits. First, the TCL module generates temporally consistent descriptors for specific keypoint motion patterns. Second, the SDL module produces spatially diverse detectors for applications in high-speed motion scenarios. Extensive experimental results on three challenging benchmarks show that our method notably outperforms state-of-the-art event keypoint detection methods. Specifically, our STPNet can outperform the best event keypoint detection method by 0.21px in reprj. error on Event-Camera, 4% in IoU on N-Caltech101, 0.13px in reprj. error on HVGA ATIS Corner and 5.94% in matching accuracy on DSEC.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9384-9397"},"PeriodicalIF":11.1,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Matryoshka Learning With Metric Transfer for Image-Text Matching 基于度量迁移的套娃学习用于图像-文本匹配

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-08 DOI: 10.1109/TCSVT.2025.3558996

Pengzhe Wang;Lei Zhang;Zhendong Mao;Nenan Lyu;Yongdong Zhang

Image-text matching is a significant technology for vision-language tasks, as it bridges the semantic gap between visual and text modalities. Although existing methods have achieved remarkable progress, high-dimensional embeddings or ensemble methods are often used to achieve sufficiently good recall or accuracy, which significantly increase the computational and storage costs in practical applications. Knowledge distillation can help achieve resource-efficient deployment, however, existing techniques are not directly applicable to cross-modal matching scenarios. The main difficulties arise from two aspects: 1) the distillation from teacher model to student model is usually conducted in two separate stages, and this inconsistency in learning objectives may lead to sub-optimal compression results. 2) distilling knowledge from each modality independently cannot ensure the preservation of cross-modal alignment established in the original embeddings, which can lead to the compressed ones failing to achieve accurate alignment. To address these issues, we propose a novel Matryoshka Learning with Metric Transfer framework (MAMET) for image-text matching. After capturing multi-granularity information through multiple high-dimensional embeddings, we propose an efficient Matryoshka training process with shared backbone to compress the different granularity information into a low-dimensional embedding, facilitating the integration of cross-modal matching and knowledge distillation in one single stage. Meanwhile, a novel metric transfer criterion is innovated to diversely align the metric relations across embedding spaces of different dimensions and modalities, ensuring a good cross-modal alignment after distillation. In this way, our MAMET transfers strong representation and generalization capability from the high-dimensional ensemble models to a basic network, which not only can get great performance boost, but also introduce no extra overhead during online inference. Extensive experiments on benchmark datasets demonstrate the superior effectiveness and efficiency of our MAMET, consistently achieving an average of 2%-20% performance improvement over state-of-the-art methods across various backbones and domains.

图像-文本匹配是一项重要的视觉语言任务技术，因为它弥合了视觉和文本模式之间的语义差距。虽然现有的方法已经取得了显著的进步，但在实际应用中，通常使用高维嵌入或集成方法来获得足够好的召回率或准确性，这大大增加了计算和存储成本。知识蒸馏可以帮助实现资源高效部署，但是现有的技术并不直接适用于跨模态匹配场景。主要困难来自两个方面：1)从教师模型到学生模型的升华通常分两个独立的阶段进行，这种学习目标的不一致性可能导致压缩结果的次优。2)从每个模态中独立提取知识不能保证保留在原始嵌入中建立的跨模态对齐，这可能导致压缩后的嵌入无法实现准确对齐。为了解决这些问题，我们提出了一种新的基于度量迁移框架（MAMET）的俄套字学习算法。在通过多个高维嵌入捕获多粒度信息后，提出了一种具有共享主干的高效的俄罗斯套娃训练过程，将不同粒度的信息压缩到一个低维嵌入中，实现了跨模态匹配和知识升华的一体化。同时，提出了一种新的度量转换准则，在不同维度和模态的嵌入空间中对度量关系进行多样化对齐，保证了精馏后的良好跨模态对齐。通过这种方式，我们的MAMET将高维集成模型的强大表示和泛化能力转移到基本网络中，不仅可以获得很大的性能提升，而且在在线推理过程中不会引入额外的开销。在基准数据集上进行的大量实验表明，我们的MAMET具有卓越的有效性和效率，在各种主干和领域中，与最先进的方法相比，平均性能提高了2%-20%。

{"title":"Matryoshka Learning With Metric Transfer for Image-Text Matching","authors":"Pengzhe Wang;Lei Zhang;Zhendong Mao;Nenan Lyu;Yongdong Zhang","doi":"10.1109/TCSVT.2025.3558996","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558996","url":null,"abstract":"Image-text matching is a significant technology for vision-language tasks, as it bridges the semantic gap between visual and text modalities. Although existing methods have achieved remarkable progress, high-dimensional embeddings or ensemble methods are often used to achieve sufficiently good recall or accuracy, which significantly increase the computational and storage costs in practical applications. Knowledge distillation can help achieve resource-efficient deployment, however, existing techniques are not directly applicable to cross-modal matching scenarios. The main difficulties arise from two aspects: 1) the distillation from teacher model to student model is usually conducted in two separate stages, and this inconsistency in learning objectives may lead to sub-optimal compression results. 2) distilling knowledge from each modality independently cannot ensure the preservation of cross-modal alignment established in the original embeddings, which can lead to the compressed ones failing to achieve accurate alignment. To address these issues, we propose a novel Matryoshka Learning with Metric Transfer framework (MAMET) for image-text matching. After capturing multi-granularity information through multiple high-dimensional embeddings, we propose an efficient Matryoshka training process with shared backbone to compress the different granularity information into a low-dimensional embedding, facilitating the integration of cross-modal matching and knowledge distillation in one single stage. Meanwhile, a novel metric transfer criterion is innovated to diversely align the metric relations across embedding spaces of different dimensions and modalities, ensuring a good cross-modal alignment after distillation. In this way, our MAMET transfers strong representation and generalization capability from the high-dimensional ensemble models to a basic network, which not only can get great performance boost, but also introduce no extra overhead during online inference. Extensive experiments on benchmark datasets demonstrate the superior effectiveness and efficiency of our MAMET, consistently achieving an average of 2%-20% performance improvement over state-of-the-art methods across various backbones and domains.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9502-9516"},"PeriodicalIF":11.1,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coarse-to-Fine Hypergraph Network for Spatiotemporal Action Detection 用于时空动作检测的粗-精超图网络

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-08 DOI: 10.1109/TCSVT.2025.3558939

Ping Li;Xingchao Ye;Lingfeng He

Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.

时空动作检测通过识别动作开始时间和结束时间、动作类和对象（如参与者）边界框，沿空间和时间维度定位动作实例。它面临两个主要挑战：1)在同一个类中，不同的动作持续时间和不一致的动作实例的速度；2)建模复杂的对象交互，这是以前的方法不能很好地处理的。对于前者，我们开发了从粗到精的注意模块，该模块通过消除上下文不可知的特征，采用有效的动态时间扭曲对动作框架进行粗估计，并进一步采用注意机制捕获这些动作框架内的一阶对象关系。这将导致更细粒度的动作估计。对于后者，我们设计了三阶高阶超图神经网络，对空间关系、运动动力学和跨帧不同对象的高阶关系进行建模。这鼓励了同一行动中对象的积极关系，同时抑制了不同行动中对象的消极关系。因此，我们提出了一个用于时空动作检测的粗到精超图网络（简称CFHN），该网络将对象局部上下文、一阶对象关系和高阶对象关系结合起来考虑。它结合了通道维度上的时空一阶和高阶特征，获得了令人满意的检测结果。在包括AVA、jhdb21和UCF101-24在内的几个基准测试上进行的大量实验证明了所提出方法的优越性。

{"title":"Coarse-to-Fine Hypergraph Network for Spatiotemporal Action Detection","authors":"Ping Li;Xingchao Ye;Lingfeng He","doi":"10.1109/TCSVT.2025.3558939","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558939","url":null,"abstract":"Spatiotemporal action detection localizes the action instances along both spatial and temporal dimensions, by identifying action start time and end time, action class, and object (e.g., actor) bounding boxes. It faces two primary challenges: 1) varying durations of actions and inconsistent tempo of action instances within the same class, and 2) modeling complex object interactions, which are not well handled by previous methods. For the former, we develop the coarse-to-fine attention module, which employs an efficient dynamic time warping to make a coarse estimation of action frames by eliminating context-agnostic features, and further adopts the attention mechanism to capture the first-order object relations within those action frames. This results in a finer-granularity of action estimation. For the latter, we design the ternary high-order hypergraph neural networks, which model the spatial relation, the motion dynamics, and the high-order relations of different objects across frames. This encourages the positive relation of the objects within the same actions, while suppressing the negative relation of those in different actions. Therefore, we present a Coarse-to-Fine Hypergraph Network, abbreviated as CFHN, for spatiotemporal action detection, by considering the object local context, the first-order object relations, and the high-order object relations together. It combines the spatiotemporal first-order and high-order features along the channel dimension to obtain satisfying detection results. Extensive experiments on several benchmarks including AVA, JHMDB-21, and UCF101-24 demonstrate the superiority of the proposed approach.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8653-8665"},"PeriodicalIF":11.1,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structured Light Image Planar-Topography Feature Decomposition for Generalizable 3D Shape Measurement 面向广义三维形状测量的结构光图像平面地形特征分解

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-08 DOI: 10.1109/TCSVT.2025.3558732

Mingyang Lei;Jingfan Fan;Long Shao;Hong Song;Deqiang Xiao;Danni Ai;Tianyu Fu;Yucong Lin;Ying Gu;Jian Yang

The application of structured light (SL) techniques has achieved remarkable success in three-dimensional (3D) measurements. Traditional methods generally calculate SL information pixel by pixel to obtain the measurement results. Recently, the rise of deep learning (DL) has led to significant developments in this task. However, existing DL-based methods generally learn all features within the image in an end-to-end manner, ignoring the distinction between SL and non-SL information. Therefore, these methods may encounter difficulties in focusing on subtle variations in SL patterns across different scenes, thereby degrading measurement precision. To overcome this challenge, we propose a novel SL Image Planar-Topography Feature Decomposition Network (SIDNet). To fully utilize the information from different SL modality images (fringe and speckle), we decompose different modalities into topography features (modality-specific) and planar features (modality-shared). A physics-driven decomposition loss is proposed to make the topography/planar features dissimilar/similar, which guides the network to distinguish between SL and non-SL information. Moreover, to obtain modality-fused features with global overview and local detail information, we propose a wrapped phase-driven feature fusion module. Specifically, a novel Tri-modality Mamba block is designed to integrate different sources with the guidance of the wrapped phase features. Extensive experiments demonstrate the superiority of our SIDNet in multiple simulated 3D measurement scenes. Moreover, our method shows better generalization ability than other DL models and can be directly applicable to unseen real-world scenes.

结构光（SL）技术在三维测量中的应用取得了显著的成功。传统方法一般是逐像素计算SL信息来获得测量结果。最近，深度学习（DL）的兴起导致了这一任务的重大发展。然而，现有的基于dl的方法一般都是端到端学习图像内的所有特征，忽略了SL和非SL信息的区别。因此，这些方法在关注不同场景中SL模式的细微变化时可能会遇到困难，从而降低测量精度。为了克服这一挑战，我们提出了一种新的SL图像平面地形特征分解网络（SIDNet）。为了充分利用不同SL模态图像（条纹和散斑）的信息，我们将不同模态分解为地形特征（模态特定）和平面特征（模态共享）。提出了一种物理驱动的分解损失，使地形/平面特征不相似/相似，从而指导网络区分SL和非SL信息。此外，为了获得具有全局概况和局部细节信息的模态融合特征，我们提出了一个包裹的相位驱动特征融合模块。具体来说，一个新颖的三模态曼巴块被设计成整合不同的来源与包裹相位特征的指导。大量的实验证明了我们的SIDNet在多个模拟三维测量场景中的优势。此外，该方法比其他深度学习模型具有更好的泛化能力，可以直接应用于未见过的真实场景。

{"title":"Structured Light Image Planar-Topography Feature Decomposition for Generalizable 3D Shape Measurement","authors":"Mingyang Lei;Jingfan Fan;Long Shao;Hong Song;Deqiang Xiao;Danni Ai;Tianyu Fu;Yucong Lin;Ying Gu;Jian Yang","doi":"10.1109/TCSVT.2025.3558732","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558732","url":null,"abstract":"The application of structured light (SL) techniques has achieved remarkable success in three-dimensional (3D) measurements. Traditional methods generally calculate SL information pixel by pixel to obtain the measurement results. Recently, the rise of deep learning (DL) has led to significant developments in this task. However, existing DL-based methods generally learn all features within the image in an end-to-end manner, ignoring the distinction between SL and non-SL information. Therefore, these methods may encounter difficulties in focusing on subtle variations in SL patterns across different scenes, thereby degrading measurement precision. To overcome this challenge, we propose a novel SL Image Planar-Topography Feature Decomposition Network (SIDNet). To fully utilize the information from different SL modality images (fringe and speckle), we decompose different modalities into topography features (modality-specific) and planar features (modality-shared). A physics-driven decomposition loss is proposed to make the topography/planar features dissimilar/similar, which guides the network to distinguish between SL and non-SL information. Moreover, to obtain modality-fused features with global overview and local detail information, we propose a wrapped phase-driven feature fusion module. Specifically, a novel Tri-modality Mamba block is designed to integrate different sources with the guidance of the wrapped phase features. Extensive experiments demonstrate the superiority of our SIDNet in multiple simulated 3D measurement scenes. Moreover, our method shows better generalization ability than other DL models and can be directly applicable to unseen real-world scenes.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9517-9529"},"PeriodicalIF":11.1,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BLENet: A Bio-Inspired Lightweight and Efficient Network for Left Ventricle Segmentation in Echocardiography BLENet：一种生物启发的、轻量级的、高效的超声心动图左心室分割网络

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-07 DOI: 10.1109/TCSVT.2025.3558496

Xintao Pang;Fengjuan Yao;Yanming Zhang;Yue Sun;Edmundo Patricio Lopes Lao;Chuan Lin;Patrick Cheong-Iao Pang;Wei Wang;Wei Li;Zhifan Gao;Tao Tan

In echocardiography, accurate segmentation of the left ventricle at end-diastole (ED) and end-systole (ES) is crucial for quantitative assessment of left ventricular ejection fraction. However, as a dynamic imaging modality requiring real-time analysis and frequently performed in various clinical settings with portable devices, this challenges mainstream approaches that primarily enhance model performance by increasing the number of parameters and computational costs, while lacking targeted optimization for its characteristics. To address these challenges, we propose BLENet, a lightweight segmentation model inspired by biological vision mechanisms. By integrating key mechanisms from biological vision systems with medical image features, our model achieves efficient and accurate segmentation. Specifically, the center-surround antagonism of retinal ganglion cells and the lateral geniculate nucleus exhibits high sensitivity to contrast variations, corresponding to the distinct contrast between the ventricular chamber (hypoechoic) and myocardial wall (hyperechoic) in ultrasound images. Based on this, we designed an antagonistic module to enhance feature extraction in target regions. Subsequently, the directional selectivity mechanism in the V1 cortex aligns with the variable directional features of the ventricular boundary, inspiring our direction-selective module to improve segmentation accuracy. Finally, we introduce an adaptive wavelet fusion module in the decoding network to address the limited receptive field of convolutions and enhance feature integration in cardiac ultrasound. Experiments demonstrate that our model contains only 0.16M parameters and requires no pre-training. On the CAMUS dataset, it achieves Dice coefficient values of 0.951 and 0.927 for ED and ES phases respectively, while on the EchoNet-Dynamic dataset, it achieves 0.933 and 0.909, with an inference speed of 112 FPS on NVIDIA RTX 2080 Ti. Evaluation on an external clinical dataset indicates our model’s promising generalization and potential for clinical application.

超声心动图中，左心室舒张末期（ED）和收缩末期（ES）的准确分割对于左心室射血分数的定量评估至关重要。然而，作为一种动态成像模式，需要实时分析，并且经常在各种临床环境中使用便携式设备进行，这对主要通过增加参数数量和计算成本来提高模型性能的主流方法提出了挑战，同时缺乏对其特性的针对性优化。为了解决这些挑战，我们提出了BLENet，一个受生物视觉机制启发的轻量级分割模型。该模型通过将生物视觉系统的关键机制与医学图像特征相结合，实现了高效、准确的分割。具体来说，视网膜神经节细胞和外侧膝状核的中心-周围拮抗作用对对比度变化表现出高度敏感性，对应于超声图像中室室（低回声）和心肌壁（高回声）之间的明显对比。在此基础上，我们设计了一个拮抗模块来增强目标区域的特征提取。随后，V1皮层的方向选择机制与心室边界的可变方向特征相一致，启发我们的方向选择模块提高分割精度。最后，我们在解码网络中引入了自适应小波融合模块，以解决卷积接受域有限的问题，增强心脏超声的特征集成。实验表明，我们的模型只包含0.16M个参数，不需要预训练。在CAMUS数据集上，ED和ES阶段的Dice系数分别达到0.951和0.927，在EchoNet-Dynamic数据集上，该算法达到0.933和0.909，在NVIDIA RTX 2080 Ti上的推理速度达到112 FPS。外部临床数据集的评估表明，我们的模型具有良好的推广前景和临床应用潜力。

{"title":"BLENet: A Bio-Inspired Lightweight and Efficient Network for Left Ventricle Segmentation in Echocardiography","authors":"Xintao Pang;Fengjuan Yao;Yanming Zhang;Yue Sun;Edmundo Patricio Lopes Lao;Chuan Lin;Patrick Cheong-Iao Pang;Wei Wang;Wei Li;Zhifan Gao;Tao Tan","doi":"10.1109/TCSVT.2025.3558496","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558496","url":null,"abstract":"In echocardiography, accurate segmentation of the left ventricle at end-diastole (ED) and end-systole (ES) is crucial for quantitative assessment of left ventricular ejection fraction. However, as a dynamic imaging modality requiring real-time analysis and frequently performed in various clinical settings with portable devices, this challenges mainstream approaches that primarily enhance model performance by increasing the number of parameters and computational costs, while lacking targeted optimization for its characteristics. To address these challenges, we propose BLENet, a lightweight segmentation model inspired by biological vision mechanisms. By integrating key mechanisms from biological vision systems with medical image features, our model achieves efficient and accurate segmentation. Specifically, the center-surround antagonism of retinal ganglion cells and the lateral geniculate nucleus exhibits high sensitivity to contrast variations, corresponding to the distinct contrast between the ventricular chamber (hypoechoic) and myocardial wall (hyperechoic) in ultrasound images. Based on this, we designed an antagonistic module to enhance feature extraction in target regions. Subsequently, the directional selectivity mechanism in the V1 cortex aligns with the variable directional features of the ventricular boundary, inspiring our direction-selective module to improve segmentation accuracy. Finally, we introduce an adaptive wavelet fusion module in the decoding network to address the limited receptive field of convolutions and enhance feature integration in cardiac ultrasound. Experiments demonstrate that our model contains only 0.16M parameters and requires no pre-training. On the CAMUS dataset, it achieves Dice coefficient values of 0.951 and 0.927 for ED and ES phases respectively, while on the EchoNet-Dynamic dataset, it achieves 0.933 and 0.909, with an inference speed of 112 FPS on NVIDIA RTX 2080 Ti. Evaluation on an external clinical dataset indicates our model’s promising generalization and potential for clinical application.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9218-9233"},"PeriodicalIF":11.1,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Edge Approximation Text Detector 边缘逼近文本检测器

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-07 DOI: 10.1109/TCSVT.2025.3558634

Chuang Yang;Xu Han;Tao Han;Han Han;Bingxuan Zhao;Qi Wang

Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce EdgeText to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales. Ablation experiments demonstrate that EdgeText can fit scene texts compactly and naturally. Comparisons show that EdgeText is superior to existing methods on multiple public datasets. Code is available at https://github.com/omtcyang/EdgeTD.

追求高效的文本形状表示有助于场景文本检测模型专注于紧凑的前景区域，并优化轮廓重建步骤，从而简化整个检测流程。目前的方法要么是通过盒多边形的策略来表示不规则形状，要么是将轮廓分解成小块逐步拟合，这些模型往往存在轮廓粗糙或管道复杂的缺点。考虑到上述问题，我们引入了EdgeText来紧凑地拟合文本轮廓，同时减轻了过多的轮廓重建过程。具体来看，文本的两条长边可以看作是光滑的曲线。它允许我们通过连续和光滑的边缘来构建轮廓，这些边缘紧密地覆盖文本区域，而不是分段拟合，这有助于避免当前模型中的两个限制。受此启发，EdgeText通过参数化曲线拟合函数将文本表示表述为边缘逼近问题。在推理阶段，我们的模型首先定位文本中心，然后根据这些点创建近似文本边缘的曲线函数。同时，根据位置特征确定截断点。最后，利用截断点带来的像素坐标信息，从曲线函数中提取曲线段，重建文本轮廓。此外，考虑到EdgeText对文本边缘的深度依赖，设计了双边增强感知（BEP）模块。它鼓励我们的模型关注边缘特征的识别。此外，为了加速曲线函数参数的学习，我们引入了比例积分损失（PI-loss），以迫使所提出的模型专注于曲线分布，避免受到文本尺度的干扰。烧蚀实验表明，EdgeText能够紧凑、自然地拟合场景文本。比较表明，EdgeText在多个公共数据集上优于现有方法。代码可从https://github.com/omtcyang/EdgeTD获得。

{"title":"Edge Approximation Text Detector","authors":"Chuang Yang;Xu Han;Tao Han;Han Han;Bingxuan Zhao;Qi Wang","doi":"10.1109/TCSVT.2025.3558634","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3558634","url":null,"abstract":"Pursuing efficient text shape representations helps scene text detection models focus on compact foreground regions and optimize the contour reconstruction steps to simplify the whole detection pipeline. Current approaches either represent irregular shapes via box-to-polygon strategy or decomposing a contour into pieces for fitting gradually, the deficiency of coarse contours or complex pipelines always exists in these models. Considering the above issues, we introduce <italic>EdgeText</i> to fit text contours compactly while alleviating excessive contour rebuilding processes. Concretely, it is observed that the two long edges of texts can be regarded as smooth curves. It allows us to build contours via continuous and smooth edges that cover text regions tightly instead of fitting piecewise, which helps avoid the two limitations in current models. Inspired by this observation, EdgeText formulates the text representation as the edge approximation problem via parameterized curve fitting functions. In the inference stage, our model starts with locating text centers, and then creating curve functions for approximating text edges relying on the points. Meanwhile, truncation points are determined based on the location features. In the end, extracting curve segments from curve functions by using the pixel coordinate information brought by truncation points to reconstruct text contours. Furthermore, considering the deep dependency of EdgeText on text edges, a bilateral enhanced perception (BEP) module is designed. It encourages our model to pay attention to the recognition of edge features. Additionally, to accelerate the learning of the curve function parameters, we introduce a proportional integral loss (PI-loss) to force the proposed model to focus on the curve distribution and avoid being disturbed by text scales. Ablation experiments demonstrate that EdgeText can fit scene texts compactly and naturally. Comparisons show that EdgeText is superior to existing methods on multiple public datasets. Code is available at <uri>https://github.com/omtcyang/EdgeTD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9234-9245"},"PeriodicalIF":11.1,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CMNet: Cross-Modal Coarse-to-Fine Network for Point Cloud Completion Based on Patches CMNet：基于patch的点云补全的跨模态粗到精网络

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-04 DOI: 10.1109/TCSVT.2025.3557842

Zhenjiang Du;Zhitao Liu;Guan Wang;Jiwei Wei;Sophyani Banaamwini Yussif;Zheng Wang;Ning Xie;Yang Yang

Point clouds serve as the foundational representation of 3D objects, playing a pivotal role in both computer vision and computer graphics. Recently, the acquisition of point clouds has been effortless because of the development of hardware devices. However, the collected point clouds may be incomplete due to environmental conditions, such as occlusion. Therefore, completing partial point clouds becomes an essential task. The majority of current methods address point cloud completion via the utilization of shape priors. While these methods have demonstrated commendable performance, they often encounter challenges in preserving the global structural and geometric details of the 3D shape. In contrast to those mentioned earlier, we propose a novel cross-modal coarse-to-fine network (CMNet) for point cloud completion. Our method utilizes additional image information to provide global information, thus avoiding the loss of structure. To ensure that the generated results contain sufficient geometric details, we propose a coarse-to-fine learning approach based on multiple patches. Specifically, we encode the image and use multiple generators to generate multiple coarse patches, which are combined into a complete shape. Subsequently, based on the coarse patches generated in advance, we generate fine patches by combining partial point cloud information. Experimental results show that our method achieves state-of-the-art performance on point cloud completion.

点云是三维物体的基本表征，在计算机视觉和计算机图形学中都起着举足轻重的作用。最近，由于硬件设备的发展，点云的获取变得毫不费力。然而，由于环境条件（如遮挡）的影响，收集到的点云可能是不完整的。因此，对局部点云进行补全成为一项必不可少的任务。目前的大多数方法通过利用形状先验来解决点云补全问题。虽然这些方法表现出了令人称赞的性能，但它们在保留三维形状的整体结构和几何细节方面经常遇到挑战。与前面提到的相比，我们提出了一种新的跨模态粗到精网络（CMNet）用于点云补全。我们的方法利用附加的图像信息来提供全局信息，从而避免了结构的丢失。为了确保生成的结果包含足够的几何细节，我们提出了一种基于多块的从粗到精的学习方法。具体来说，我们对图像进行编码，并使用多个生成器生成多个粗块，这些粗块组合成一个完整的形状。随后，在预先生成粗补丁的基础上，结合部分点云信息生成精细补丁。实验结果表明，该方法在点云补全方面取得了较好的效果。

{"title":"CMNet: Cross-Modal Coarse-to-Fine Network for Point Cloud Completion Based on Patches","authors":"Zhenjiang Du;Zhitao Liu;Guan Wang;Jiwei Wei;Sophyani Banaamwini Yussif;Zheng Wang;Ning Xie;Yang Yang","doi":"10.1109/TCSVT.2025.3557842","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3557842","url":null,"abstract":"Point clouds serve as the foundational representation of 3D objects, playing a pivotal role in both computer vision and computer graphics. Recently, the acquisition of point clouds has been effortless because of the development of hardware devices. However, the collected point clouds may be incomplete due to environmental conditions, such as occlusion. Therefore, completing partial point clouds becomes an essential task. The majority of current methods address point cloud completion via the utilization of shape priors. While these methods have demonstrated commendable performance, they often encounter challenges in preserving the global structural and geometric details of the 3D shape. In contrast to those mentioned earlier, we propose a novel cross-modal coarse-to-fine network (CMNet) for point cloud completion. Our method utilizes additional image information to provide global information, thus avoiding the loss of structure. To ensure that the generated results contain sufficient geometric details, we propose a coarse-to-fine learning approach based on multiple patches. Specifically, we encode the image and use multiple generators to generate multiple coarse patches, which are combined into a complete shape. Subsequently, based on the coarse patches generated in advance, we generate fine patches by combining partial point cloud information. Experimental results show that our method achieves state-of-the-art performance on point cloud completion.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9132-9147"},"PeriodicalIF":11.1,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

IEEE Circuits and Systems Society Information IEEE电路与系统学会信息

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology

Pub Date : 2025-04-04 DOI: 10.1109/TCSVT.2025.3547204

引用次数: 0