IEEE Transactions on Multimedia最新文献_第6页

DBSR: Quadratic Conditional Diffusion Model for Blind Cardiac MRI Super-Resolution DBSR：用于心脏磁共振成像盲超分辨率的二次条件扩散模型

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453059

Defu Qiu;Yuhu Cheng;Kelvin K.L. Wong;Wenjun Zhang;Zhang Yi;Xuesong Wang

Cardiac magnetic resonance imaging (CMRI) can help experts quickly diagnose cardiovascular diseases. Due to the patient's breathing and slight movement during the magnetic resonance imaging scan, the obtained CMRI may be severely blurred, affecting the accuracy of clinical diagnosis. To address this issue, we propose the quadratic conditional diffusion model for blind CMRI super-resolution (DBSR). Specifically, we propose a conditional blur kernel noise predictor, which predicts the blur kernel from low-resolution images by the diffusion model, transforming the unknown blur kernel in low-resolution CMRI into a known one. Meanwhile, we design a novel conditional CMRI noise predictor, which uses the predicted blur kernel as prior knowledge to guide the diffusion model in reconstructing high-resolution CMRI. Furthermore, we propose a cascaded residual attention network feature extractor, which extracts feature information from CMRI low-resolution images for blur kernel prediction and SR reconstruction of CMRI images. Extensive experimental results indicate that our proposed DBSR achieves better blind super-resolution reconstruction results than several state-of-the-art baselines.

心脏磁共振成像（CMRI）可以帮助专家快速诊断心血管疾病。在磁共振成像扫描过程中，由于患者的呼吸和轻微移动，获得的 CMRI 可能会严重模糊，影响临床诊断的准确性。针对这一问题，我们提出了用于盲 CMRI 超分辨率（DBSR）的二次条件扩散模型。具体来说，我们提出了条件模糊核噪声预测器，通过扩散模型从低分辨率图像中预测模糊核，将低分辨率 CMRI 中的未知模糊核转化为已知模糊核。同时，我们设计了一种新型的条件 CMRI 噪声预测器，将预测的模糊核作为先验知识，指导扩散模型重建高分辨率 CMRI。此外，我们还提出了一种级联残差注意网络特征提取器，它能从 CMRI 低分辨率图像中提取特征信息，用于模糊核预测和 CMRI 图像的 SR 重建。广泛的实验结果表明，我们提出的 DBSR 比几种最先进的基线方法取得了更好的盲超解像重建效果。

{"title":"DBSR: Quadratic Conditional Diffusion Model for Blind Cardiac MRI Super-Resolution","authors":"Defu Qiu;Yuhu Cheng;Kelvin K.L. Wong;Wenjun Zhang;Zhang Yi;Xuesong Wang","doi":"10.1109/TMM.2024.3453059","DOIUrl":"10.1109/TMM.2024.3453059","url":null,"abstract":"Cardiac magnetic resonance imaging (CMRI) can help experts quickly diagnose cardiovascular diseases. Due to the patient's breathing and slight movement during the magnetic resonance imaging scan, the obtained CMRI may be severely blurred, affecting the accuracy of clinical diagnosis. To address this issue, we propose the quadratic conditional diffusion model for blind CMRI super-resolution (DBSR). Specifically, we propose a conditional blur kernel noise predictor, which predicts the blur kernel from low-resolution images by the diffusion model, transforming the unknown blur kernel in low-resolution CMRI into a known one. Meanwhile, we design a novel conditional CMRI noise predictor, which uses the predicted blur kernel as prior knowledge to guide the diffusion model in reconstructing high-resolution CMRI. Furthermore, we propose a cascaded residual attention network feature extractor, which extracts feature information from CMRI low-resolution images for blur kernel prediction and SR reconstruction of CMRI images. Extensive experimental results indicate that our proposed DBSR achieves better blind super-resolution reconstruction results than several state-of-the-art baselines.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11358-11371"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LFS-Aware Surface Reconstruction From Unoriented 3D Point Clouds 从无定向三维点云重建具有 LFS 意识的曲面

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453050

Rao Fu;Kai Hormann;Pierre Alliez

We present a novel approach for generating isotropic surface triangle meshes directly from unoriented 3D point clouds, with the mesh density adapting to the estimated local feature size (LFS). Popular reconstruction pipelines first reconstruct a dense mesh from the input point cloud and then apply remeshing to obtain an isotropic mesh. The sequential pipeline makes it hard to find a lower-density mesh while preserving more details. Instead, our approach reconstructs both an implicit function and an LFS-aware mesh sizing function directly from the input point cloud, which is then used to produce the final LFS-aware mesh without remeshing. We combine local curvature radius and shape diameter to estimate the LFS directly from the input point clouds. Additionally, we propose a new mesh solver to solve an implicit function whose zero level set delineates the surface without requiring normal orientation. The added value of our approach is generating isotropic meshes directly from 3D point clouds with an LFS-aware density, thus achieving a trade-off between geometric detail and mesh complexity. Our experiments also demonstrate the robustness of our method to noise, outliers, and missing data and can preserve sharp features for CAD point clouds.

我们提出了一种直接从未定向三维点云生成各向同性曲面三角形网格的新方法，网格密度与估计的局部特征尺寸（LFS）相适应。流行的重建管道首先从输入点云重建密集网格，然后应用重网格化技术获得各向同性网格。这种顺序管道很难在保留更多细节的同时找到密度更低的网格。取而代之的是，我们的方法直接从输入点云重建隐式函数和 LFS 感知网格大小函数，然后使用它们生成最终的 LFS 感知网格，而无需重网格化。我们结合局部曲率半径和形状直径，直接从输入点云估算 LFS。此外，我们还提出了一种新的网格求解器，用于求解隐式函数，该函数的零水平集无需法线定向即可划定曲面。我们的方法的附加值是直接从三维点云生成各向同性网格，并具有 LFS 感知密度，从而在几何细节和网格复杂度之间实现了权衡。我们的实验还证明了我们的方法对噪声、异常值和缺失数据的鲁棒性，并能保留 CAD 点云的锐利特征。

{"title":"LFS-Aware Surface Reconstruction From Unoriented 3D Point Clouds","authors":"Rao Fu;Kai Hormann;Pierre Alliez","doi":"10.1109/TMM.2024.3453050","DOIUrl":"10.1109/TMM.2024.3453050","url":null,"abstract":"We present a novel approach for generating isotropic surface triangle meshes directly from unoriented 3D point clouds, with the mesh density adapting to the estimated local feature size (LFS). Popular reconstruction pipelines first reconstruct a dense mesh from the input point cloud and then apply remeshing to obtain an isotropic mesh. The sequential pipeline makes it hard to find a lower-density mesh while preserving more details. Instead, our approach reconstructs both an implicit function and an LFS-aware mesh sizing function directly from the input point cloud, which is then used to produce the final LFS-aware mesh without remeshing. We combine local curvature radius and shape diameter to estimate the LFS directly from the input point clouds. Additionally, we propose a new mesh solver to solve an implicit function whose zero level set delineates the surface without requiring normal orientation. The added value of our approach is generating isotropic meshes directly from 3D point clouds with an LFS-aware density, thus achieving a trade-off between geometric detail and mesh complexity. Our experiments also demonstrate the robustness of our method to noise, outliers, and missing data and can preserve sharp features for CAD point clouds.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11415-11427"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Prior Driven Resolution Rescaling Blocks for Intra Frame Coding 用于帧内编码的多优先级驱动分辨率重缩块

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453033

Peiying Wu;Shiwei Wang;Liquan Shen;Feifeng Wang;Zhaoyi Tian;Xia Hua

Deep learning techniques are increasingly integrated into rescaling-based video compression frameworks and have shown great potential in improving compression efficiency. However, existing methods achieve limited performance because 1) they treat context priors generated by codec as independent sources of information, ignoring potential interactions between multiple priors in rescaling, which may not effectively facilitate compression; 2) they often employ a uniform sampling ratio across regions with varying content complexities, resulting in the loss of important information. To address the above two issues, this paper proposes a spatial multi-prior driven resolution rescaling framework for intra-frame coding, called MP-RRF, consisting of three sub-networks: a multi-prior driven network, a downscaling network, and an upscaling network. First, the multi-prior driven network employs complexity and similarity priors to smooth the unnecessarily complicated information while leveraging similarity and quality priors to produce high-fidelity complementary information. This interaction of complexity, similarity and quality priors ensures redundancy reduction and texture enhancement. Second, the downscaling network discriminatively processes components of different granularities to generate a compact, low-resolution image for encoding. The upscaling network aggregates a complementary set of contextual multi-scale features to reconstruct realistic details while combining variable receptive fields to suppress multi-scale compression artifacts and resampling noise. Extensive experiments show that our network achieves a significant 23.84% Bjøntegaard Delta Rate (BD-Rate) reduction under all-intra configuration compared to the codec anchor, offering the state-of-the-art coding performance.

深度学习技术越来越多地被集成到基于重缩放的视频压缩框架中，并在提高压缩效率方面显示出巨大潜力。然而，现有方法的性能有限，原因在于：1）它们将编解码器生成的上下文先验视为独立的信息源，忽略了重缩放过程中多个先验之间潜在的相互作用，可能无法有效促进压缩；2）它们通常在内容复杂度不同的区域采用统一的采样率，导致重要信息丢失。为解决上述两个问题，本文提出了一种用于帧内编码的空间多前验驱动分辨率重缩放框架，称为 MP-RRF，由三个子网络组成：多前验驱动网络、缩放网络和提升网络。首先，多先验驱动网络利用复杂性和相似性先验来平滑不必要的复杂信息，同时利用相似性和质量先验来产生高保真互补信息。这种复杂性、相似性和质量先验的相互作用确保了冗余的减少和纹理的增强。其次，降维网络对不同粒度的成分进行鉴别处理，生成紧凑、低分辨率的图像进行编码。升频网络汇聚了一组互补的上下文多尺度特征，以重建逼真的细节，同时结合可变感受野以抑制多尺度压缩伪影和重采样噪声。广泛的实验表明，与编解码器锚点相比，我们的网络在全内配置下显著降低了 23.84% 的比昂特加德Δ率（BD-Rate），提供了最先进的编码性能。

{"title":"Multi-Prior Driven Resolution Rescaling Blocks for Intra Frame Coding","authors":"Peiying Wu;Shiwei Wang;Liquan Shen;Feifeng Wang;Zhaoyi Tian;Xia Hua","doi":"10.1109/TMM.2024.3453033","DOIUrl":"10.1109/TMM.2024.3453033","url":null,"abstract":"Deep learning techniques are increasingly integrated into rescaling-based video compression frameworks and have shown great potential in improving compression efficiency. However, existing methods achieve limited performance because 1) they treat context priors generated by codec as independent sources of information, ignoring potential interactions between multiple priors in rescaling, which may not effectively facilitate compression; 2) they often employ a uniform sampling ratio across regions with varying content complexities, resulting in the loss of important information. To address the above two issues, this paper proposes a spatial multi-prior driven resolution rescaling framework for intra-frame coding, called MP-RRF, consisting of three sub-networks: a multi-prior driven network, a downscaling network, and an upscaling network. First, the multi-prior driven network employs complexity and similarity priors to smooth the unnecessarily complicated information while leveraging similarity and quality priors to produce high-fidelity complementary information. This interaction of complexity, similarity and quality priors ensures redundancy reduction and texture enhancement. Second, the downscaling network discriminatively processes components of different granularities to generate a compact, low-resolution image for encoding. The upscaling network aggregates a complementary set of contextual multi-scale features to reconstruct realistic details while combining variable receptive fields to suppress multi-scale compression artifacts and resampling noise. Extensive experiments show that our network achieves a significant 23.84% Bjøntegaard Delta Rate (BD-Rate) reduction under all-intra configuration compared to the codec anchor, offering the state-of-the-art coding performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11274-11289"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SMC-NCA: Semantic-Guided Multi-Level Contrast for Semi-Supervised Temporal Action Segmentation SMC-NCA：用于半监督时态动作分割的语义引导多级对比技术

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3452980

Feixiang Zhou;Zheheng Jiang;Huiyu Zhou;Xuelong Li

Semi-supervised temporal action segmentation (SS-TAS) aims to perform frame-wise classification in long untrimmed videos, where only a fraction of videos in the training set have labels. Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data. However, learning the representation of each frame by unsupervised contrastive learning for action segmentation remains an open and challenging problem. In this paper, we propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations for SS-TAS. Specifically, for representation learning, SMC is first used to explore intra- and inter-information variations in a unified and contrastive way, based on action-specific semantic information and temporal information highlighting relations between actions. Then, the NCA module, which is responsible for enforcing spatial consistency between neighbourhoods centered at different frames to alleviate over-segmentation issues, works alongside SMC for semi-supervised learning (SSL). Our SMC outperforms the other state-of-the-art methods on three benchmarks, offering improvements of up to 17.8

$%$

and 12.6

$%$

in terms of Edit distance and accuracy, respectively. Additionally, the NCA unit results in significantly better segmentation performance in the presence of only 5

$%$

labelled videos. We also demonstrate the generalizability and effectiveness of the proposed method on our Parkinson's Disease Mouse Behaviour (PDMB) dataset.

半监督时间动作分割（SS-TAS）旨在对未经剪辑的长视频进行按帧分类，在这种情况下，训练集中只有一小部分视频有标签。最近的研究表明，对比学习在使用无标签数据进行无监督表示学习方面具有潜力。然而，通过无监督对比学习来学习每帧动作分割的表示仍然是一个开放且具有挑战性的问题。在本文中，我们提出了一种新颖的语义引导多级对比方案，该方案具有邻域一致性感知单元（SMC-NCA），可为 SS-TAS 提取强帧表示。具体来说，在表征学习方面，首先使用 SMC 以统一和对比的方式，根据特定动作的语义信息和突出动作间关系的时间信息，探索信息内和信息间的变化。然后，NCA 模块与 SMC 一起进行半监督学习 (SSL)，NCA 模块负责在以不同帧为中心的邻域之间执行空间一致性，以缓解过度分割问题。在三个基准测试中，我们的 SMC 优于其他最先进的方法，在 Edit 距离和准确度方面分别提高了 17.8%$ 和 12.6%$。此外，NCA 单元在仅有 5%$ 标记视频的情况下也能显著提高分割性能。我们还在帕金森病小鼠行为（PDMB）数据集上证明了所提方法的通用性和有效性。

{"title":"SMC-NCA: Semantic-Guided Multi-Level Contrast for Semi-Supervised Temporal Action Segmentation","authors":"Feixiang Zhou;Zheheng Jiang;Huiyu Zhou;Xuelong Li","doi":"10.1109/TMM.2024.3452980","DOIUrl":"10.1109/TMM.2024.3452980","url":null,"abstract":"Semi-supervised temporal action segmentation (SS-TAS) aims to perform frame-wise classification in long untrimmed videos, where only a fraction of videos in the training set have labels. Recent studies have shown the potential of contrastive learning in unsupervised representation learning using unlabelled data. However, learning the representation of each frame by unsupervised contrastive learning for action segmentation remains an open and challenging problem. In this paper, we propose a novel Semantic-guided Multi-level Contrast scheme with a Neighbourhood-Consistency-Aware unit (SMC-NCA) to extract strong frame-wise representations for SS-TAS. Specifically, for representation learning, SMC is first used to explore intra- and inter-information variations in a unified and contrastive way, based on action-specific semantic information and temporal information highlighting relations between actions. Then, the NCA module, which is responsible for enforcing spatial consistency between neighbourhoods centered at different frames to alleviate over-segmentation issues, works alongside SMC for semi-supervised learning (SSL). Our SMC outperforms the other state-of-the-art methods on three benchmarks, offering improvements of up to 17.8\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 and 12.6\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 in terms of Edit distance and accuracy, respectively. Additionally, the NCA unit results in significantly better segmentation performance in the presence of only 5\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 labelled videos. We also demonstrate the generalizability and effectiveness of the proposed method on our Parkinson's Disease Mouse Behaviour (PDMB) dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11386-11401"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Elaborate Teacher: Improved Semi-Supervised Object Detection With Rich Image Exploiting 精心设计的教师：利用丰富的图像开发改进半监督物体检测

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453040

Xi Yang;Qiubai Zhou;Ziyu Wei;Hong Liu;Nannan Wang;Xinbo Gao

Semi-Supervised Object Detection (SSOD) has shown remarkable results by leveraging image pairs with a teacher-student framework. An excellent strong augmentation method can generate richer images and alleviate the influence of noise in pseudo-labels. However, existing data augmentation methods for SSOD do not consider instance-level information, thus, they cannot make full use of unlabeled data. Besides, the current teacher-student framework in SSOD solely relies on pseudo-labeling techniques, which may disregard some uncertain information. In this article, we introduce a new method called Elaborate Teacher which generates and exploits image pairs in a more refined manner. To enrich strongly augmented images, a novel data augmentation method called Information-Aware Mixup Representation (IAMR) is proposed. IAMR utilizes the teacher model's predictions as prior information and considers instance-level information, which can be seamlessly integrated with existing SSOD data augmentation methods. Furthermore, to fully exploit the information in unlabeled data, we propose the Enhanced Scale Consistency Regularization (ESCR), which considers the consistency from both semantic space and feature space. Elaborate Teacher introduces a fresh data augmentation method, complemented by consistency regularization, which boosts the performance of semi-supervised object detectors. Extensive experiments on the PASCAL VOC and MS-COCO datasets demonstrate the effectiveness of our method in leveraging unlabeled image information. Our method consistently outperforms the baseline method and improves mAP by 11.6% and 9.0% relative to the supervised baseline method when using 5% and 10% of labeled data on MS-COCO, respectively.

半监督物体检测（SSOD）通过利用师生框架下的图像对，取得了显著效果。优秀的强增强方法可以生成更丰富的图像，并减轻伪标签中噪声的影响。然而，现有的 SSOD 数据增强方法没有考虑实例级信息，因此无法充分利用未标记数据。此外，目前 SSOD 中的师生框架仅依赖于伪标签技术，这可能会忽略一些不确定的信息。在本文中，我们介绍了一种名为 "精心设计的教师 "的新方法，它能以更精细的方式生成和利用图像对。为了丰富强增强图像，我们提出了一种名为 "信息感知混合表示法"（IAMR）的新型数据增强方法。IAMR 利用教师模型的预测作为先验信息，并考虑实例级信息，可与现有的 SSOD 数据增强方法无缝集成。此外，为了充分利用未标记数据中的信息，我们提出了增强尺度一致性正则化（ESCR），它同时考虑了语义空间和特征空间的一致性。阐释老师介绍了一种全新的数据增强方法，并辅以一致性正则化，从而提高了半监督对象检测器的性能。在 PASCAL VOC 和 MS-COCO 数据集上进行的大量实验证明了我们的方法在利用未标记图像信息方面的有效性。在 MS-COCO 数据集上使用 5% 和 10% 的标记数据时，我们的方法始终优于基线方法，相对于监督基线方法，mAP 分别提高了 11.6% 和 9.0%。

{"title":"Elaborate Teacher: Improved Semi-Supervised Object Detection With Rich Image Exploiting","authors":"Xi Yang;Qiubai Zhou;Ziyu Wei;Hong Liu;Nannan Wang;Xinbo Gao","doi":"10.1109/TMM.2024.3453040","DOIUrl":"10.1109/TMM.2024.3453040","url":null,"abstract":"Semi-Supervised Object Detection (SSOD) has shown remarkable results by leveraging image pairs with a teacher-student framework. An excellent strong augmentation method can generate richer images and alleviate the influence of noise in pseudo-labels. However, existing data augmentation methods for SSOD do not consider instance-level information, thus, they cannot make full use of unlabeled data. Besides, the current teacher-student framework in SSOD solely relies on pseudo-labeling techniques, which may disregard some uncertain information. In this article, we introduce a new method called Elaborate Teacher which generates and exploits image pairs in a more refined manner. To enrich strongly augmented images, a novel data augmentation method called Information-Aware Mixup Representation (IAMR) is proposed. IAMR utilizes the teacher model's predictions as prior information and considers instance-level information, which can be seamlessly integrated with existing SSOD data augmentation methods. Furthermore, to fully exploit the information in unlabeled data, we propose the Enhanced Scale Consistency Regularization (ESCR), which considers the consistency from both semantic space and feature space. Elaborate Teacher introduces a fresh data augmentation method, complemented by consistency regularization, which boosts the performance of semi-supervised object detectors. Extensive experiments on the \u0000<italic>PASCAL VOC\u0000 and \u0000<italic>MS-COCO\u0000 datasets demonstrate the effectiveness of our method in leveraging unlabeled image information. Our method consistently outperforms the baseline method and improves mAP by 11.6% and 9.0% relative to the supervised baseline method when using 5% and 10% of labeled data on \u0000<italic>MS-COCO\u0000, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11345-11357"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Discriminative Motion Models for Multiple Object Tracking 学习用于多目标跟踪的判别运动模型

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-09-02 DOI: 10.1109/TMM.2024.3453057

Yi-Fan Li;Hong-Bing Ji;Wen-Bo Zhang;Yu-Kun Lai

Motion models are vital for solving multiple object tracking (MOT), which makes instance-level position predictions of targets to handle occlusions and noisy detections. Recent methods have proposed the use of Single Object Tracking (SOT) techniques to build motion models and unify the SOT tracker with the object detector into a single network for high-efficiency MOT. However, three feature incompatibility issues in the required features of this paradigm are ignored, leading to inferior performance. First, the object detector requires class-specific features to localize objects of pre-defined classes. Contrarily, target-specific features are required in SOT to track the target of interest with an unknown category. Second, MOT relies on intra-class differences to associate targets of the same identity (ID). On the other hand, the SOT trackers focus on inter-class differences to distinguish the tracking target from the background. Third, classification confidence is used to determine the existence of targets, which is obtained with category-related features and cannot accurately reveal the existence of targets in tracking scenes. To address these issues, we propose a novel Task-specific Feature Encoding Network (TFEN) to extract task-driven features for different sub-networks. Besides, we propose a novel Quadruplet State Sampling (QSS) strategy to form the training samples of the motion model and guide the SOT trackers to capture identity-discriminative features in position predictions. Finally, we propose an Existence Aware Tracking (EAT) algorithm by estimating the existence confidence of targets and re-considering low-scored predictions to recover missed targets. Experimental results indicate that the proposed Discriminative Motion Model-based tracker (DMMTracker) can effectively address these issues when employing SOT trackers as motion models, leading to highly competitive results on MOT benchmarks.

运动模型对于解决多目标跟踪（MOT）问题至关重要，MOT 需要对目标进行实例级位置预测，以处理遮挡和噪声检测。最近的方法提出使用单目标跟踪（SOT）技术来建立运动模型，并将单目标跟踪器与目标检测器统一为一个网络，以实现高效的多目标跟踪。然而，这种模式所需的特征中有三个不兼容问题被忽略了，导致性能较差。首先，物体检测器需要特定类别的特征来定位预定义类别的物体。与此相反，SOT 需要特定于目标的特征来跟踪未知类别的感兴趣目标。其次，MOT 依靠类内差异来关联相同身份（ID）的目标。另一方面，SOT 追踪器则侧重于类间差异，以将追踪目标与背景区分开来。第三，分类置信度用于确定目标是否存在，它是通过与类别相关的特征获得的，不能准确揭示跟踪场景中目标的存在。为了解决这些问题，我们提出了一种新颖的特定任务特征编码网络（TFEN），为不同的子网络提取任务驱动特征。此外，我们还提出了一种新颖的四元状态采样（QSS）策略，用于形成运动模型的训练样本，并指导 SOT 跟踪器在位置预测中捕捉身份鉴别特征。最后，我们提出了一种 "存在感知跟踪"（EAT）算法，通过估计目标的存在置信度和重新考虑低分预测来恢复错过的目标。实验结果表明，当采用 SOT 跟踪器作为运动模型时，所提出的基于判别运动模型的跟踪器（DMMTracker）能有效解决这些问题，从而在 MOT 基准测试中取得极具竞争力的结果。

{"title":"Learning Discriminative Motion Models for Multiple Object Tracking","authors":"Yi-Fan Li;Hong-Bing Ji;Wen-Bo Zhang;Yu-Kun Lai","doi":"10.1109/TMM.2024.3453057","DOIUrl":"10.1109/TMM.2024.3453057","url":null,"abstract":"Motion models are vital for solving multiple object tracking (MOT), which makes instance-level position predictions of targets to handle occlusions and noisy detections. Recent methods have proposed the use of Single Object Tracking (SOT) techniques to build motion models and unify the SOT tracker with the object detector into a single network for high-efficiency MOT. However, three feature incompatibility issues in the required features of this paradigm are ignored, leading to inferior performance. First, the object detector requires class-specific features to localize objects of pre-defined classes. Contrarily, target-specific features are required in SOT to track the target of interest with an unknown category. Second, MOT relies on intra-class differences to associate targets of the same identity (ID). On the other hand, the SOT trackers focus on inter-class differences to distinguish the tracking target from the background. Third, classification confidence is used to determine the existence of targets, which is obtained with category-related features and cannot accurately reveal the existence of targets in tracking scenes. To address these issues, we propose a novel Task-specific Feature Encoding Network (TFEN) to extract task-driven features for different sub-networks. Besides, we propose a novel Quadruplet State Sampling (QSS) strategy to form the training samples of the motion model and guide the SOT trackers to capture identity-discriminative features in position predictions. Finally, we propose an Existence Aware Tracking (EAT) algorithm by estimating the existence confidence of targets and re-considering low-scored predictions to recover missed targets. Experimental results indicate that the proposed Discriminative Motion Model-based tracker (DMMTracker) can effectively address these issues when employing SOT trackers as motion models, leading to highly competitive results on MOT benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11372-11385"},"PeriodicalIF":8.4,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Language-Guided Dual-Modal Local Correspondence for Single Object Tracking 用于单个物体跟踪的语言引导双模局部对应技术

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-08-29 DOI: 10.1109/TMM.2024.3410141

Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu

This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.

本文重点探讨计算机视觉中单目标跟踪技术的发展，该技术应用广泛，包括机器人视觉、视频监控和体育视频分析等。由于外观特征中目标语义的稀缺性和目标外观的持续变化，目前仅依赖目标初始视觉信息的方法遇到了性能瓶颈和应用限制。为了解决这些问题，我们提出了一种结合视觉语言双模态单目标跟踪的新方法，利用自然语言描述来丰富移动目标的语义信息。我们引入了一种基于局部对应建模的双模态单目标跟踪算法。该算法将视觉特征分解为多个局部视觉语义特征，并将它们与从自然语言描述中提取的局部语言特征配对。此外，我们还提出了一种新的全局重新定位方法，该方法利用视觉语言双模信息来感知目标消失和错位，并在整个图像中自适应地重新定位目标。这提高了跟踪器在长时间内适应目标外观变化的能力，实现了基于双模语义和运动信息的长期单一目标跟踪。实验结果表明，我们的模型优于最先进的方法，这证明了我们方法的有效性和效率。

{"title":"Language-Guided Dual-Modal Local Correspondence for Single Object Tracking","authors":"Jun Yu;Zhongpeng Cai;Yihao Li;Lei Wang;Fang Gao;Ye Yu","doi":"10.1109/TMM.2024.3410141","DOIUrl":"10.1109/TMM.2024.3410141","url":null,"abstract":"This paper focuses on the advancement of single-object tracking technologies in computer vision, which have broad applications including robotic vision, video surveillance, and sports video analysis. Current methods relying solely on the target's initial visual information encounter performance bottlenecks and limited applications, due to the scarcity of target semantics in appearance features and the continuous change in the target's appearance. To address these issues, we propose a novel approach, combining visual-language dual-modal single-object tracking, that leverages natural language descriptions to enrich the semantic information of the moving target. We introduce a dual-modal single-object tracking algorithm based on local correspondence modeling. The algorithm decomposes visual features into multiple local visual semantic features and pairs them with local language features extracted from natural language descriptions. In addition, we also propose a new global relocalization method that utilizes visual language bimodal information to perceive target disappearance and misalignment and adaptively reposition the target in the entire image. This improves the tracker's ability to adapt to changes in target appearance over long periods of time, enabling long-term single target tracking based on bimodal semantic and motion information. Experimental results show that our model outperforms state-of-the-art methods, which demonstrates the effectiveness and efficiency of our approach.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10637-10650"},"PeriodicalIF":8.4,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MAAN: Memory-Augmented Auto-Regressive Network for Text-Driven 3D Indoor Scene Generation MAAN：用于文本驱动三维室内场景生成的内存增强自动回归网络

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-08-26 DOI: 10.1109/TMM.2024.3443657

Zhaoda Ye;Yang Liu;Yuxin Peng

The objective of text-driven 3D indoor scene generation is to automatically generate and arrange the objects to form a 3D scene that accurately captures the semantics detailed in the given text description. Existing approaches are mainly guided by specific object categories and room layout to generate and position objects like furniture within 3D indoor scenes. However, few methods harness the potential of the text description to precisely control both spatial relationships and object combinations. Consequently, these methods lack a robust mechanism for determining accurate object attributes necessary to craft a plausible 3D scene that maintains consistent spatial relationships in alignment with the provided text description. To tackle these issues, we propose the Memory-Augmented Auto-regressive Network (MAAN), which is a text-driven method for synthesizing 3D indoor scenes with controllable spatial relationships and object compositions. Firstly, we propose a memory-augmented network to help the model decide the attributes of the objects, such as 3D coordinates, rotation and size, which improves the consistency of the object spatial relations with text descriptions. Our approach constructs a memory context to select relevant objects within the scene, which provides spatial information that aids in generating the new object with the correct attributes. Secondly, we develop a prior attribute prediction network to learn how to generate a complete scene with suitable and reasonable object compositions. This prior attribute prediction network adopts a pre-training strategy to extract composition priors from existing scenes, which enables the organization of multiple objects to form a reasonable scene and keeps the object relations according to the text descriptions. We conduct experiments on three different room types (bedroom, living room, and dining room) on the 3D-FRONT dataset. The results of these experiments underscore the accuracy of our method in governing spatial relationships among objects, showcasing its superior flexibility compared to existing techniques.

文本驱动三维室内场景生成的目的是自动生成和排列物体，以形成一个三维场景，准确捕捉给定文本描述中的详细语义。现有的方法主要以特定的物体类别和房间布局为指导，在三维室内场景中生成和定位家具等物体。然而，很少有方法能利用文本描述的潜力来精确控制空间关系和物体组合。因此，这些方法缺乏一个强大的机制来确定准确的物体属性，而这些属性是制作一个可信的三维场景所必需的，它能与所提供的文本描述保持一致的空间关系。为了解决这些问题，我们提出了记忆增强自回归网络（MAAN），这是一种文本驱动的方法，用于合成具有可控空间关系和物体组合的三维室内场景。首先，我们提出了一种记忆增强网络来帮助模型确定物体的属性，如三维坐标、旋转和大小，从而提高了物体空间关系与文本描述的一致性。我们的方法构建了一个记忆上下文来选择场景中的相关对象，从而提供空间信息，帮助生成具有正确属性的新对象。其次，我们开发了一个先验属性预测网络，以学习如何生成一个具有合适、合理的对象构成的完整场景。该先验属性预测网络采用预训练策略，从现有场景中提取构图先验，从而将多个物体组织起来形成合理的场景，并根据文本描述保持物体关系。我们在 3D-FRONT 数据集上对三种不同的房间类型（卧室、客厅和餐厅）进行了实验。这些实验结果表明，我们的方法能准确地处理物体之间的空间关系，与现有技术相比具有更高的灵活性。

{"title":"MAAN: Memory-Augmented Auto-Regressive Network for Text-Driven 3D Indoor Scene Generation","authors":"Zhaoda Ye;Yang Liu;Yuxin Peng","doi":"10.1109/TMM.2024.3443657","DOIUrl":"10.1109/TMM.2024.3443657","url":null,"abstract":"The objective of text-driven 3D indoor scene generation is to automatically generate and arrange the objects to form a 3D scene that accurately captures the semantics detailed in the given text description. Existing approaches are mainly guided by specific object categories and room layout to generate and position objects like furniture within 3D indoor scenes. However, few methods harness the potential of the text description to precisely control both \u0000<italic>spatial relationships\u0000 and \u0000<italic>object combinations\u0000. Consequently, these methods lack a robust mechanism for determining accurate object attributes necessary to craft a plausible 3D scene that maintains consistent spatial relationships in alignment with the provided text description. To tackle these issues, we propose the Memory-Augmented Auto-regressive Network (MAAN), which is a text-driven method for synthesizing 3D indoor scenes with controllable spatial relationships and object compositions. Firstly, we propose a memory-augmented network to help the model decide the attributes of the objects, such as 3D coordinates, rotation and size, which improves the consistency of the object spatial relations with text descriptions. Our approach constructs a memory context to select relevant objects within the scene, which provides spatial information that aids in generating the new object with the correct attributes. Secondly, we develop a prior attribute prediction network to learn how to generate a complete scene with suitable and reasonable object compositions. This prior attribute prediction network adopts a pre-training strategy to extract composition priors from existing scenes, which enables the organization of multiple objects to form a reasonable scene and keeps the object relations according to the text descriptions. We conduct experiments on three different room types (bedroom, living room, and dining room) on the 3D-FRONT dataset. The results of these experiments underscore the accuracy of our method in governing spatial relationships among objects, showcasing its superior flexibility compared to existing techniques.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11057-11069"},"PeriodicalIF":8.4,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Weakly Supervised Text-to-Audio Grounding 实现弱监督文本到音频的接地

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-08-23 DOI: 10.1109/TMM.2024.3443614

Xuenan Xu;Ziyang Ma;Mengyue Wu;Kai Yu

Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for training. WSTAG is superior to strongly-supervised approaches in its scalability to large audio-text datasets. Two WSTAG frameworks are studied in this paper: sentence-level and phrase-level. First, we analyze the limitations of mean pooling used in the previous WSTAG approach and investigate the effects of different pooling strategies. We then propose phrase-level WSTAG to use matching labels between audio clips and phrases for training. Advanced negative sampling strategies and self-supervision are proposed to enhance the accuracy of the weak labels and provide pseudo strong labels. Experimental results show that our system significantly outperforms previous WSTAG methods. Finally, we conduct extensive experiments to analyze the effects of several factors on phrase-level WSTAG.

文本到音频接地（TAG）任务旨在预测由自然语言描述的声音事件的起始点和终止点。这项任务有助于多模态信息检索等应用。本文的重点是弱监督文本到音频接地（WSTAG），在这种情况下，无法获得声音事件的帧级注释，只能利用整个音频片段的标题进行训练。与强监督方法相比，WSTAG 在扩展大型音频文本数据集方面更具优势。本文研究了两种 WSTAG 框架：句子级和短语级。首先，我们分析了以往 WSTAG 方法中使用的均值池的局限性，并研究了不同池策略的效果。然后，我们提出了短语级 WSTAG，使用音频片段和短语之间的匹配标签进行训练。我们提出了先进的负采样策略和自监督，以提高弱标签的准确性，并提供伪强标签。实验结果表明，我们的系统明显优于之前的 WSTAG 方法。最后，我们进行了大量实验，分析了一些因素对短语级 WSTAG 的影响。

{"title":"Towards Weakly Supervised Text-to-Audio Grounding","authors":"Xuenan Xu;Ziyang Ma;Mengyue Wu;Kai Yu","doi":"10.1109/TMM.2024.3443614","DOIUrl":"10.1109/TMM.2024.3443614","url":null,"abstract":"Text-to-audio grounding (TAG) task aims to predict the onsets and offsets of sound events described by natural language. This task can facilitate applications such as multimodal information retrieval. This paper focuses on weakly-supervised text-to-audio grounding (WSTAG), where frame-level annotations of sound events are unavailable, and only the caption of a whole audio clip can be utilized for training. WSTAG is superior to strongly-supervised approaches in its scalability to large audio-text datasets. Two WSTAG frameworks are studied in this paper: sentence-level and phrase-level. First, we analyze the limitations of mean pooling used in the previous WSTAG approach and investigate the effects of different pooling strategies. We then propose phrase-level WSTAG to use matching labels between audio clips and phrases for training. Advanced negative sampling strategies and self-supervision are proposed to enhance the accuracy of the weak labels and provide pseudo strong labels. Experimental results show that our system significantly outperforms previous WSTAG methods. Finally, we conduct extensive experiments to analyze the effects of several factors on phrase-level WSTAG.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11126-11138"},"PeriodicalIF":8.4,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142178742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CarveNet: Carving Point-Block for Complex 3D Shape Completion CarveNet：用于复杂三维形状补全的雕刻点块

IF 7.3 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia

Pub Date : 2024-08-16 DOI: 10.1109/tmm.2024.3443613

Qing Guo, Zhijie Wang, Lubo Wang, Haotian Dong, Felix Juefei-Xu, Di Lin, Lei Ma, Wei Feng, Yang Liu

引用次数: 0