ACM Transactions on Multimedia Computing Communications and Applications最新文献_第5页

Bridging the Domain Gap in Scene Flow Estimation via Hierarchical Smoothness Refinement 通过层次平滑度细化缩小场景流估计中的领域差距

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-27 DOI: 10.1145/3661823

Dejun Zhang, Mian Zhang, Xuefeng Tan, Jun Liu

This paper introduces SmoothFlowNet3D, an innovative encoder-decoder architecture specifically designed for bridging the domain gap in scene flow estimation. To achieve this goal, SmoothFlowNet3D divides the scene flow estimation task into two stages: initial scene flow estimation and smoothness refinement. Specifically, SmoothFlowNet3D comprises a hierarchical encoder that extracts multi-scale point cloud features from two consecutive frames, along with a hierarchical decoder responsible for predicting the initial scene flow and further refining it to achieve smoother estimation. To generate the initial scene flow, a cross-frame nearest neighbor search operation is performed between the features extracted from two consecutive frames, resulting in forward and backward flow embeddings. These embeddings are then combined to form the bidirectional flow embedding, serving as input for predicting the initial scene flow. Additionally, a flow smoothing module based on the self-attention mechanism is proposed to predict the smoothing error and facilitate the refinement of the initial scene flow for more accurate and smoother estimation results. Extensive experiments demonstrate that the proposed SmoothFlowNet3D approach achieves state-of-the-art performance on both synthetic datasets and real LiDAR point clouds, confirming its effectiveness in enhancing scene flow smoothness.

本文介绍了 SmoothFlowNet3D，这是一种创新的编码器-解码器架构，专门用于缩小场景流估计领域的差距。为实现这一目标，SmoothFlowNet3D 将场景流估算任务分为两个阶段：初始场景流估算和平滑度细化。具体来说，SmoothFlowNet3D 由一个分层编码器和一个分层解码器组成，前者负责从两个连续帧中提取多尺度点云特征，后者负责预测初始场景流并进一步细化以实现更平滑的估算。为了生成初始场景流，需要对从两个连续帧中提取的特征进行跨帧近邻搜索操作，从而生成前向流和后向流嵌入。然后将这些内嵌组合起来形成双向流内嵌，作为预测初始场景流的输入。此外，还提出了一个基于自我注意机制的流平滑模块，用于预测平滑误差，并促进初始场景流的细化，以获得更准确、更平滑的估计结果。大量实验证明，所提出的 SmoothFlowNet3D 方法在合成数据集和真实激光雷达点云上都达到了最先进的性能，证实了它在增强场景流平滑度方面的有效性。

{"title":"Bridging the Domain Gap in Scene Flow Estimation via Hierarchical Smoothness Refinement","authors":"Dejun Zhang, Mian Zhang, Xuefeng Tan, Jun Liu","doi":"10.1145/3661823","DOIUrl":"https://doi.org/10.1145/3661823","url":null,"abstract":"This paper introduces SmoothFlowNet3D, an innovative encoder-decoder architecture specifically designed for bridging the domain gap in scene flow estimation. To achieve this goal, SmoothFlowNet3D divides the scene flow estimation task into two stages: initial scene flow estimation and smoothness refinement. Specifically, SmoothFlowNet3D comprises a hierarchical encoder that extracts multi-scale point cloud features from two consecutive frames, along with a hierarchical decoder responsible for predicting the initial scene flow and further refining it to achieve smoother estimation. To generate the initial scene flow, a cross-frame nearest neighbor search operation is performed between the features extracted from two consecutive frames, resulting in forward and backward flow embeddings. These embeddings are then combined to form the bidirectional flow embedding, serving as input for predicting the initial scene flow. Additionally, a flow smoothing module based on the self-attention mechanism is proposed to predict the smoothing error and facilitate the refinement of the initial scene flow for more accurate and smoother estimation results. Extensive experiments demonstrate that the proposed SmoothFlowNet3D approach achieves state-of-the-art performance on both synthetic datasets and real LiDAR point clouds, confirming its effectiveness in enhancing scene flow smoothness.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"1 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EdiTor: Edge-guided Transformer for Ghost-free High Dynamic Range Imaging EdiTor：用于无鬼影高动态范围成像的边缘导向变压器

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-27 DOI: 10.1145/3657293

Yuanshen Guan, Ruikang Xu, Mingde Yao, Jie Huang, Zhiwei Xiong

Synthesizing the high dynamic range (HDR) image from multi-exposure images has been extensively studied by exploiting convolutional neural networks (CNNs) recently. Despite the remarkable progress, existing CNN-based methods have the intrinsic limitation of local receptive field, which hinders the model’s capability of capturing long-range correspondence and large motions across under/over-exposure images, resulting in ghosting artifacts of dynamic scenes. To address the above challenge, we propose a novel Edge-guided Transformer framework (EdiTor) customized for ghost-free HDR reconstruction, where the long-range motions across different exposures can be delicately modeled by incorporating the edge prior. Specifically, EdiTor calculates patch-wise correlation maps on both image and edge domains, enabling the network to effectively model the global movements and the fine-grained shifts across multiple exposures. Based on this framework, we further propose an exposure-masked loss to adaptively compensate for the severely distorted regions (e.g., highlights and shadows). Experiments demonstrate that EdiTor outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors.

近年来，利用卷积神经网络（CNN）合成多曝光图像的高动态范围（HDR）图像已得到广泛研究。尽管取得了显著进展，但现有的基于卷积神经网络的方法存在局部感受野的固有局限性，这阻碍了模型捕捉欠曝/过曝图像中长距离对应关系和大运动的能力，从而导致动态场景出现重影伪影。为了应对上述挑战，我们提出了一种新颖的边缘引导变换器框架（EdiTor），该框架专为无重影 HDR 重建而定制，通过结合边缘先验，可以对不同曝光下的长距离运动进行精细建模。具体来说，EdiTor 在图像域和边缘域上计算斑块相关图，从而使网络能够有效地对多次曝光中的全局运动和细粒度偏移进行建模。在此框架的基础上，我们进一步提出了一种曝光掩码损失，以适应性地补偿严重失真的区域（如高光和阴影）。实验证明，EdiTor 在定量和定性方面都优于最先进的方法，实现了具有统一纹理和色彩的吸引人的 HDR 可视化。

{"title":"EdiTor: Edge-guided Transformer for Ghost-free High Dynamic Range Imaging","authors":"Yuanshen Guan, Ruikang Xu, Mingde Yao, Jie Huang, Zhiwei Xiong","doi":"10.1145/3657293","DOIUrl":"https://doi.org/10.1145/3657293","url":null,"abstract":"Synthesizing the high dynamic range (HDR) image from multi-exposure images has been extensively studied by exploiting convolutional neural networks (CNNs) recently. Despite the remarkable progress, existing CNN-based methods have the intrinsic limitation of local receptive field, which hinders the model’s capability of capturing long-range correspondence and large motions across under/over-exposure images, resulting in ghosting artifacts of dynamic scenes. To address the above challenge, we propose a novel Edge-guided Transformer framework (EdiTor) customized for ghost-free HDR reconstruction, where the long-range motions across different exposures can be delicately modeled by incorporating the edge prior. Specifically, EdiTor calculates patch-wise correlation maps on both image and edge domains, enabling the network to effectively model the global movements and the fine-grained shifts across multiple exposures. Based on this framework, we further propose an exposure-masked loss to adaptively compensate for the severely distorted regions (e.g., highlights and shadows). Experiments demonstrate that EdiTor outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"5 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learned Video Compression with Adaptive Temporal Prior and Decoded Motion-aided Quality Enhancement 具有自适应时序先验和解码运动辅助质量增强功能的学习视频压缩技术

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-27 DOI: 10.1145/3661824

Jiayu Yang, Chunhui Yang, Fei Xiong, Yongqi Zhai, Ronggang Wang

Learned video compression has drawn great attention and shown promising compression performance recently. In this paper, we focus on the two components in learned video compression framework, i.e., conditional entropy model and quality enhancement module, to improve compression performance. Specifically, we propose an adaptive spatial-temporal entropy model for image, motion and residual compression, which introduces temporal prior to reduce temporal redundancy of latents and an additional modulated mask to evaluate the similarity and perform refinement. Besides, a quality enhancement module is proposed for predicted frame and reconstructed frame to improve frame quality and reduce bitrate cost of residual coding. The module reuses decoded optical flow as motion prior and utilizes deformable convolution to mine high-quality information from reference frame in a bit-free manner. The two proposed coding tools are integrated into a pixel-domain residual-coding based compression framework to evaluate their effectiveness. Experimental results demonstrate that our framework achieves competitive compression performance in low-delay scenario, compared with recent learning-based methods and traditional H.265/HEVC in terms of PSNR and MS-SSIM. The code is available at OpenLVC.

最近，学习视频压缩引起了广泛关注，并显示出良好的压缩性能。在本文中，我们将重点关注学习视频压缩框架中的两个组件，即条件熵模型和质量增强模块，以提高压缩性能。具体来说，我们提出了一种用于图像、运动和残差压缩的自适应时空熵模型，该模型引入了时间先验来减少潜变量的时间冗余，并引入了一个额外的调制掩码来评估相似性并进行细化。此外，还针对预测帧和重建帧提出了质量增强模块，以提高帧质量并降低残差编码的比特率成本。该模块重新使用解码光流作为运动先验，并利用可变形卷积以无比特方式从参考帧中挖掘高质量信息。为了评估这两种编码工具的有效性，我们将它们集成到一个基于像素域残差编码的压缩框架中。实验结果表明，就 PSNR 和 MS-SSIM 而言，与最新的基于学习的方法和传统 H.265/HEVC 相比，我们的框架在低延迟场景下实现了有竞争力的压缩性能。代码可在 OpenLVC 上获取。

{"title":"Learned Video Compression with Adaptive Temporal Prior and Decoded Motion-aided Quality Enhancement","authors":"Jiayu Yang, Chunhui Yang, Fei Xiong, Yongqi Zhai, Ronggang Wang","doi":"10.1145/3661824","DOIUrl":"https://doi.org/10.1145/3661824","url":null,"abstract":"Learned video compression has drawn great attention and shown promising compression performance recently. In this paper, we focus on the two components in learned video compression framework, i.e., conditional entropy model and quality enhancement module, to improve compression performance. Specifically, we propose an adaptive spatial-temporal entropy model for image, motion and residual compression, which introduces temporal prior to reduce temporal redundancy of latents and an additional modulated mask to evaluate the similarity and perform refinement. Besides, a quality enhancement module is proposed for predicted frame and reconstructed frame to improve frame quality and reduce bitrate cost of residual coding. The module reuses decoded optical flow as motion prior and utilizes deformable convolution to mine high-quality information from reference frame in a bit-free manner. The two proposed coding tools are integrated into a pixel-domain residual-coding based compression framework to evaluate their effectiveness. Experimental results demonstrate that our framework achieves competitive compression performance in low-delay scenario, compared with recent learning-based methods and traditional H.265/HEVC in terms of PSNR and MS-SSIM. The code is available at OpenLVC.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"50 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Domain-invariant and Patch-discriminative Feature Learning for General Deepfake Detection 用于一般深度伪造检测的域不变性和斑块判别特征学习

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-27 DOI: 10.1145/3657297

Jian Zhang, Jiangqun Ni, Fan Nie, jiwu Huang

Hyper-realistic avatars in the metaverse have already raised security concerns about deepfake techniques, deepfakes involving generated video “recording” may be mistaken for a real recording of the people it depicts. As a result, deepfake detection has drawn considerable attention in the multimedia forensic community. Though existing methods for deepfake detection achieve fairly good performance under the intra-dataset scenario, many of them gain unsatisfying results in the case of cross-dataset testing with more practical value, where the forged faces in training and testing datasets are from different domains. To tackle this issue, in this paper, we propose a novel Domain-Invariant and Patch-Discriminative feature learning framework - DI&PD. For image-level feature learning, a single-side adversarial domain generalization is introduced to eliminate domain variances and learn domain-invariant features in training samples from different manipulation methods, along with the global and local random crop augmentation strategy to generate more data views of forged images at various scales. A graph structure is then built by splitting the learned image-level feature maps, with each spatial location corresponding to a local patch, which facilitates patch representation learning by message-passing among similar nodes. Two types of center losses are utilized to learn more discriminative features in both image-level and patch-level embedding spaces. Extensive experimental results on several datasets demonstrate the effectiveness and generalization of the proposed method compared with other state-of-the-art methods.

元宇宙中的超逼真头像已经引起了人们对深度伪造技术的安全担忧，涉及生成的视频 "记录 "的深度伪造可能会被误认为是所描述人物的真实记录。因此，深度伪造检测引起了多媒体取证界的极大关注。虽然现有的深度伪造检测方法在数据集内的情况下取得了相当好的性能，但在更具实用价值的跨数据集测试中，即训练数据集和测试数据集中的伪造人脸来自不同领域的情况下，许多方法的结果并不令人满意。为了解决这个问题，我们在本文中提出了一个新颖的领域不变和斑块判别特征学习框架--DI&PD。在图像级特征学习方面，我们引入了单侧对抗域泛化来消除域变异，并在来自不同操作方法的训练样本中学习域不变特征，同时采用全局和局部随机裁剪增强策略来生成更多不同尺度伪造图像的数据视图。然后，通过分割学习到的图像级特征图来构建图结构，每个空间位置对应一个局部补丁，从而通过相似节点之间的信息传递来促进补丁表示学习。利用两种类型的中心损失，可以在图像级和补丁级嵌入空间中学习到更具区分性的特征。在多个数据集上进行的大量实验结果表明，与其他最先进的方法相比，所提出的方法既有效又具有普适性。

{"title":"Domain-invariant and Patch-discriminative Feature Learning for General Deepfake Detection","authors":"Jian Zhang, Jiangqun Ni, Fan Nie, jiwu Huang","doi":"10.1145/3657297","DOIUrl":"https://doi.org/10.1145/3657297","url":null,"abstract":"Hyper-realistic avatars in the metaverse have already raised security concerns about deepfake techniques, deepfakes involving generated video “recording” may be mistaken for a real recording of the people it depicts. As a result, deepfake detection has drawn considerable attention in the multimedia forensic community. Though existing methods for deepfake detection achieve fairly good performance under the intra-dataset scenario, many of them gain unsatisfying results in the case of cross-dataset testing with more practical value, where the forged faces in training and testing datasets are from different domains. To tackle this issue, in this paper, we propose a novel Domain-Invariant and Patch-Discriminative feature learning framework - DI&PD. For image-level feature learning, a single-side adversarial domain generalization is introduced to eliminate domain variances and learn domain-invariant features in training samples from different manipulation methods, along with the global and local random crop augmentation strategy to generate more data views of forged images at various scales. A graph structure is then built by splitting the learned image-level feature maps, with each spatial location corresponding to a local patch, which facilitates patch representation learning by message-passing among similar nodes. Two types of center losses are utilized to learn more discriminative features in both image-level and patch-level embedding spaces. Extensive experimental results on several datasets demonstrate the effectiveness and generalization of the proposed method compared with other state-of-the-art methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"100 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrated Sensing, Communication, and Computing for Cost-effective Multimodal Federated Perception 集成传感、通信和计算功能，实现经济高效的多模式联合感知

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-26 DOI: 10.1145/3661313

Ning Chen, Zhipeng Cheng, Xuwei Fan, Zhang Liu, Bangzhen Huang, Yifeng Zhao, Lianfen Huang, Xiaojiang Du, Mohsen Guizani

Federated learning (FL) is a prominent paradigm of 6G edge intelligence (EI), which mitigates privacy breaches and high communication pressure caused by conventional centralized model training in the artificial intelligence of things (AIoT). The execution of multimodal federated perception (MFP) services comprises three sub-processes, including sensing-based multimodal data generation, communication-based model transmission, and computing-based model training, ultimately competitive on available underlying multi-domain physical resources such as time, frequency, and computing power. How to reasonably coordinate the multi-domain resources scheduling among sensing, communication, and computing, therefore, is vital to the MFP networks. To address the above issues, this paper explores service-oriented resource management with integrated sensing, communication, and computing (ISCC). Specifically, employing the incentive mechanism of the MFP service market, the resources management problem is defined as a social welfare maximization problem, where the concept of “expanding resources” and “reducing costs” is used to enhance learning performance gain and reduce resource costs. Experimental results demonstrate the effectiveness and robustness of the proposed resource scheduling mechanisms.

联合学习（Federated Learning，FL）是6G边缘智能（EI）的一个重要范式，它可以缓解人工智能物联网（AIoT）中传统集中式模型训练造成的隐私泄露和高通信压力。多模态联合感知（MFP）服务的执行包括三个子过程，包括基于感知的多模态数据生成、基于通信的模型传输和基于计算的模型训练，最终竞争的是可用的底层多域物理资源，如时间、频率和计算能力。因此，如何合理协调传感、通信和计算之间的多域资源调度，对多模态网络至关重要。针对上述问题，本文探讨了集成传感、通信和计算（ISCC）的面向服务的资源管理。具体来说，本文利用多功能蜂窝网络服务市场的激励机制，将资源管理问题定义为一个社会福利最大化问题，其中使用了 "扩大资源 "和 "降低成本 "的概念，以提高学习性能收益并降低资源成本。实验结果证明了所提出的资源调度机制的有效性和稳健性。

{"title":"Integrated Sensing, Communication, and Computing for Cost-effective Multimodal Federated Perception","authors":"Ning Chen, Zhipeng Cheng, Xuwei Fan, Zhang Liu, Bangzhen Huang, Yifeng Zhao, Lianfen Huang, Xiaojiang Du, Mohsen Guizani","doi":"10.1145/3661313","DOIUrl":"https://doi.org/10.1145/3661313","url":null,"abstract":"Federated learning (FL) is a prominent paradigm of 6G edge intelligence (EI), which mitigates privacy breaches and high communication pressure caused by conventional centralized model training in the artificial intelligence of things (AIoT). The execution of multimodal federated perception (MFP) services comprises three sub-processes, including sensing-based multimodal data generation, communication-based model transmission, and computing-based model training, ultimately competitive on available underlying multi-domain physical resources such as time, frequency, and computing power. How to reasonably coordinate the multi-domain resources scheduling among sensing, communication, and computing, therefore, is vital to the MFP networks. To address the above issues, this paper explores service-oriented resource management with integrated sensing, communication, and computing (ISCC). Specifically, employing the incentive mechanism of the MFP service market, the resources management problem is defined as a social welfare maximization problem, where the concept of “expanding resources” and “reducing costs” is used to enhance learning performance gain and reduce resource costs. Experimental results demonstrate the effectiveness and robustness of the proposed resource scheduling mechanisms.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"64 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140800799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High Efficiency Deep-learning Based Video Compression 基于深度学习的高效视频压缩

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-23 DOI: 10.1145/3661311

Lv Tang, Xinfeng Zhang

Although deep learning technique has achieved significant improvement on image compression, but its advantages are not fully explored in video compression, which leads to the performance of deep-learning based video compression (DLVC) is obvious inferior to that of hybrid video coding framework. In this paper, we proposed a novel network to improve the performance of DLVC from its most important modules, including Motion Process (MP), Residual Compression (RC) and Frame Reconstruction (FR). In MP, we design a split second-order attention and multi-scale feature extraction module to fully remove the warping artifacts from multi-scale feature space and pixel space, which can help reduce the distortion in the following process. In RC, we propose a channel selection mechanism to gradually drop redundant information while preserving informative channels for a better rate-distortion performance. Finally, in FR, we introduce a residual multi-scale recurrent network to improve the quality of the current reconstructed frame by progressively exploiting temporal context information between it and its several previous reconstructed frames. Extensive experiments are conducted on the three widely used video compression datasets (HEVC, UVG and MCL-JVC), and the performance demonstrates the superiority of our proposed approach over the state-of-the-art methods.

虽然深度学习技术在图像压缩方面取得了显著的改进，但其在视频压缩方面的优势并未得到充分发挥，这导致基于深度学习的视频压缩（DLVC）的性能明显不如混合视频编码框架。本文提出了一种新型网络，从运动处理（MP）、残差压缩（RC）和帧重构（FR）等最重要的模块入手提高 DLVC 的性能。在运动处理（MP）中，我们设计了一个分裂的二阶注意和多尺度特征提取模块，以充分去除多尺度特征空间和像素空间中的翘曲伪影，从而有助于减少后续处理过程中的失真。在 RC 中，我们提出了一种信道选择机制，在保留信息信道的同时逐步去除冗余信息，以获得更好的速率失真性能。最后，在 FR 中，我们引入了一个残差多尺度递归网络，通过逐步利用当前重建帧与之前几个重建帧之间的时间上下文信息，提高当前重建帧的质量。我们在三个广泛使用的视频压缩数据集（HEVC、UVG 和 MCL-JVC）上进行了广泛的实验，实验结果表明我们提出的方法优于最先进的方法。

{"title":"High Efficiency Deep-learning Based Video Compression","authors":"Lv Tang, Xinfeng Zhang","doi":"10.1145/3661311","DOIUrl":"https://doi.org/10.1145/3661311","url":null,"abstract":"Although deep learning technique has achieved significant improvement on image compression, but its advantages are not fully explored in video compression, which leads to the performance of deep-learning based video compression (DLVC) is obvious inferior to that of hybrid video coding framework. In this paper, we proposed a novel network to improve the performance of DLVC from its most important modules, including Motion Process (MP), Residual Compression (RC) and Frame Reconstruction (FR). In MP, we design a split second-order attention and multi-scale feature extraction module to fully remove the warping artifacts from multi-scale feature space and pixel space, which can help reduce the distortion in the following process. In RC, we propose a channel selection mechanism to gradually drop redundant information while preserving informative channels for a better rate-distortion performance. Finally, in FR, we introduce a residual multi-scale recurrent network to improve the quality of the current reconstructed frame by progressively exploiting temporal context information between it and its several previous reconstructed frames. Extensive experiments are conducted on the three widely used video compression datasets (HEVC, UVG and MCL-JVC), and the performance demonstrates the superiority of our proposed approach over the state-of-the-art methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"99 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Recurrent Appearance Flow for Occlusion-Free Virtual Try-On 用于无闭塞虚拟试戴的循环外观流

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-23 DOI: 10.1145/3659581

Xiaoling Gu, Junkai Zhu, Yongkang Wong, Zizhao Wu, Jun Yu, Jianping Fan, Mohan S. Kankanhalli

Image-based virtual try-on aims at transferring a target in-shop garment onto a reference person, which has garnered significant attention from the research communities recently. However, previous methods have faced severe challenges in handling occlusion problems. To address this limitation, we classify occlusion problems into three types based on the reference person’s arm postures: single-arm occlusion, two-arm non-crossed occlusion, and two-arm crossed occlusion. Specifically, we propose a novel Occlusion-Free Virtual Try-On Network (OF-VTON) that effectively overcomes these occlusion challenges. The OF-VTON framework consists of two core components: i) a new Recurrent Appearance Flow based Deformation (RAFD) model that robustly aligns the in-shop garment to the reference person by adopting a multi-task learning strategy. This model jointly produces the dense appearance flow to warp the garment and predicts a human segmentation map to provide semantic guidance for the subsequent image synthesis model. ii) a powerful Multi-mask Image SynthesiS (MISS) model that generates photo-realistic try-on results by introducing a new mask generation and selection mechanism. Experimental results demonstrate that our proposed OF-VTON significantly outperforms existing state-of-the-art methods by mitigating the impact of occlusion problems. Our code is available at https://github.com/gxl-groups/OF-VTON.

基于图像的虚拟试穿旨在将目标店内服装转移到参照人身上，近来引起了研究界的极大关注。然而，以往的方法在处理遮挡问题时面临严峻挑战。针对这一局限，我们根据参照人的手臂姿势将遮挡问题分为三种类型：单臂遮挡、双臂非交叉遮挡和双臂交叉遮挡。具体来说，我们提出了一种新颖的无遮挡虚拟试戴网络 (OF-VTON)，可有效克服这些遮挡难题。OF-VTON 框架由两个核心部分组成：i) 一个新的基于递归外观流的变形（RAFD）模型，通过采用多任务学习策略，将店内服装与参考人稳健地对齐。该模型联合生成密集的外观流以扭曲服装，并预测人体分割图，为后续的图像合成模型提供语义指导；ii) 强大的多掩模图像合成（MISS）模型，通过引入新的掩模生成和选择机制，生成逼真的试穿结果。实验结果表明，我们提出的 OF-VTON 通过减轻遮挡问题的影响，大大优于现有的最先进方法。我们的代码见 https://github.com/gxl-groups/OF-VTON。

{"title":"Recurrent Appearance Flow for Occlusion-Free Virtual Try-On","authors":"Xiaoling Gu, Junkai Zhu, Yongkang Wong, Zizhao Wu, Jun Yu, Jianping Fan, Mohan S. Kankanhalli","doi":"10.1145/3659581","DOIUrl":"https://doi.org/10.1145/3659581","url":null,"abstract":"Image-based virtual try-on aims at transferring a target in-shop garment onto a reference person, which has garnered significant attention from the research communities recently. However, previous methods have faced severe challenges in handling occlusion problems. To address this limitation, we classify occlusion problems into three types based on the reference person’s arm postures: single-arm occlusion, two-arm non-crossed occlusion, and two-arm crossed occlusion. Specifically, we propose a novel Occlusion-Free Virtual Try-On Network (OF-VTON) that effectively overcomes these occlusion challenges. The OF-VTON framework consists of two core components: i) a new Recurrent Appearance Flow based Deformation (RAFD) model that robustly aligns the in-shop garment to the reference person by adopting a multi-task learning strategy. This model jointly produces the dense appearance flow to warp the garment and predicts a human segmentation map to provide semantic guidance for the subsequent image synthesis model. ii) a powerful Multi-mask Image SynthesiS (MISS) model that generates photo-realistic try-on results by introducing a new mask generation and selection mechanism. Experimental results demonstrate that our proposed OF-VTON significantly outperforms existing state-of-the-art methods by mitigating the impact of occlusion problems. Our code is available at https://github.com/gxl-groups/OF-VTON.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"19 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-grained Representation Aggregating Transformer with Gating Cycle for Change Captioning 带门控周期的多粒度表示聚合变换器，用于更改字幕

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-22 DOI: 10.1145/3660346

Shengbin Yue, Yunbin Tu, Liang Li, Shengxiang Gao, Zhengtao Yu

Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.

变化字幕旨在用自然语言描述图像对中的差异，它将视觉理解和语言生成结合在一起。尽管已经取得了重大进展，但从不同视角感知物体变化，尤其是视角急剧变化的严重情况，仍然是一个关键挑战。在本文中，我们提出了一种新颖的全注意力网络，即多粒度表征聚合转换器（MURAT），以区分实际变化和视角变化。具体来说，对编码器首先以多层次的方式捕捉成对对象之间的相似语义，并将其视为区分无关变化的语义线索。然后，设计出一种新颖的多粒度表征聚合器（MRA），利用粗粒度和细粒度语义线索构建可靠的差异表征。最后，语言解码器根据 MRA 的输出生成对变化的描述。此外，我们还引入了门控循环机制（Gating Cycle Mechanism），通过反向操作来促进差异表征学习与语言生成之间的语义一致性，从而弥合变化特征与文本特征之间的语义鸿沟。广泛的实验证明，所提出的 MURAT 能够在干扰无关变化的情况下大大提高描述实际变化的能力，并在 CLEVR-Change、CLEVR-DC 和 Spot-the-Diff 三个基准测试中取得了最先进的性能。

{"title":"Multi-grained Representation Aggregating Transformer with Gating Cycle for Change Captioning","authors":"Shengbin Yue, Yunbin Tu, Liang Li, Shengxiang Gao, Zhengtao Yu","doi":"10.1145/3660346","DOIUrl":"https://doi.org/10.1145/3660346","url":null,"abstract":"Change captioning aims to describe the difference within an image pair in natural language, which combines visual comprehension and language generation. Although significant progress has been achieved, it remains a key challenge of perceiving the object change from different perspectives, especially the severe situation with drastic viewpoint change. In this paper, we propose a novel full-attentive network, namely Multi-grained Representation Aggregating Transformer (MURAT), to distinguish the actual change from viewpoint change. Specifically, the Pair Encoder first captures similar semantics between pairwise objects in a multi-level manner, which are regarded as the semantic cues of distinguishing the irrelevant change. Next, a novel Multi-grained Representation Aggregator (MRA) is designed to construct the reliable difference representation by employing both coarse- and fine-grained semantic cues. Finally, the language decoder generates a description of the change based on the output of MRA. Besides, the Gating Cycle Mechanism is introduced to facilitate the semantic consistency between difference representation learning and language generation with a reverse manipulation, so as to bridge the semantic gap between change features and text features. Extensive experiments demonstrate that the proposed MURAT can greatly improve the ability to describe the actual change in the distraction of irrelevant change and achieves state-of-the-art performance on three benchmarks, CLEVR-Change, CLEVR-DC and Spot-the-Diff.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"81 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140636796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Seventeen Years of the ACM Transactions on Multimedia Computing, Communications and Applications: A Bibliometric Overview ACM 《多媒体计算、通信和应用》期刊十七年：文献计量概览

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-18 DOI: 10.1145/3660347

Walayat Hussain, Honghao Gao, Rafiul Karim, Abdulmotaleb El Saddik

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) has been dedicated to advancing multimedia research, fostering discoveries, innovations, and practical applications since 2005. The journal consistently publishes top-notch, original research in emerging fields through open submissions, calls for papers, special issues, rigorous review processes, and diverse research topics. This study aims to delve into an extensive bibliometric analysis of the journal, utilising various bibliometric indicators. The paper seeks to unveil the latent implications within the journal’s scholarly landscape from 2005 to 2022. The data primarily draws from the Web of Science (WoS) Core Collection database. The analysis encompasses diverse viewpoints, including yearly publication rates and citations, identifying highly cited papers, and assessing the most prolific authors, institutions, and countries. The paper employs VOSviewer-generated graphical maps, effectively illustrating networks of co-citations, keyword co-occurrences, and institutional and national bibliographic couplings. Furthermore, the study conducts a comprehensive global and temporal examination of co-occurrences of the author’s keywords. This investigation reveals the emergence of numerous novel keywords over the past decades.

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 自 2005 年以来一直致力于推动多媒体研究，促进发现、创新和实际应用。该期刊通过公开投稿、论文征集、特刊、严格的审稿流程和多样化的研究课题，持续发表新兴领域的一流原创研究成果。本研究旨在利用各种文献计量指标对该期刊进行广泛的文献计量分析。本文试图揭示 2005 年至 2022 年该期刊学术版图的潜在影响。数据主要来自科学网（WoS）核心期刊数据库。分析涵盖多种视角，包括年度发表率和引用率，识别高被引论文，以及评估最多产的作者、机构和国家。论文采用了 VOSviewer 生成的图形地图，有效地说明了共被引网络、关键词共现以及机构和国家书目耦合。此外，本研究还对作者关键词的全球和时间共现情况进行了全面检查。这项调查揭示了过去几十年中出现的大量新关键词。

{"title":"Seventeen Years of the ACM Transactions on Multimedia Computing, Communications and Applications: A Bibliometric Overview","authors":"Walayat Hussain, Honghao Gao, Rafiul Karim, Abdulmotaleb El Saddik","doi":"10.1145/3660347","DOIUrl":"https://doi.org/10.1145/3660347","url":null,"abstract":"ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) has been dedicated to advancing multimedia research, fostering discoveries, innovations, and practical applications since 2005. The journal consistently publishes top-notch, original research in emerging fields through open submissions, calls for papers, special issues, rigorous review processes, and diverse research topics. This study aims to delve into an extensive bibliometric analysis of the journal, utilising various bibliometric indicators. The paper seeks to unveil the latent implications within the journal’s scholarly landscape from 2005 to 2022. The data primarily draws from the Web of Science (WoS) Core Collection database. The analysis encompasses diverse viewpoints, including yearly publication rates and citations, identifying highly cited papers, and assessing the most prolific authors, institutions, and countries. The paper employs VOSviewer-generated graphical maps, effectively illustrating networks of co-citations, keyword co-occurrences, and institutional and national bibliographic couplings. Furthermore, the study conducts a comprehensive global and temporal examination of co-occurrences of the author’s keywords. This investigation reveals the emergence of numerous novel keywords over the past decades.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"25 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

C2: ABR Streaming in Cognizant of Consumption Context for Improved QoE and Resource Usage Tradeoffs C2:ABR 流媒体在消费情境中的认知，以改善 QoE 和资源使用权衡

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-04-18 DOI: 10.1145/3652517

Cheonjin Park, Chinmaey Shende, Subhabrata Sen, Bing Wang

Smartphones have emerged as ubiquitous platforms for people to consume content in a wide range of consumption contexts (C2), e.g., over cellular or WiFi, playing back audio and video directly on phone or through peripheral devices such as external screens or speakers. In this paper, we argue that a user’s specific C2 is an important factor to consider in Adaptive Bitrate (ABR) streaming. We examine the current practices of using C2 in five popular ABR players, and identify various limitations in existing treatments that have a detrimental impact on network resource usage and user experience. We then formulate C2-cognizant ABR streaming as an optimization problem and develop practical best-practice guidelines to realize it. Instantiating these guidelines, we develop a proof-of-concept implementation in the widely used state-of-the-art ExoPlayer platform and demonstrate that it leads to significantly better tradeoffs in terms of user experience and resource usage. Last, we show that the guidelines also benefit dash.js player that uses an ABR logic significantly different from that of ExoPlayer.

智能手机已成为人们在各种消费环境（C2）下消费内容的无处不在的平台，例如，通过蜂窝网络或 WiFi，直接在手机上播放音频和视频，或通过外接屏幕或扬声器等外围设备播放音频和视频。在本文中，我们认为用户的特定 C2 是自适应比特率（ABR）流媒体需要考虑的一个重要因素。我们研究了目前在五种流行 ABR 播放器中使用 C2 的做法，并找出了现有处理方法中对网络资源使用和用户体验有不利影响的各种局限性。然后，我们将识别 C2 的 ABR 流作为一个优化问题，并制定了实现该问题的实用最佳实践指南。根据这些指导原则，我们在广泛使用的最先进的 ExoPlayer 平台上开发了概念验证实施方案，并证明该方案在用户体验和资源使用方面实现了显著改善。最后，我们还展示了这些指导原则同样适用于 dash.js 播放器，该播放器使用的 ABR 逻辑与 ExoPlayer 有很大不同。

{"title":"C2: ABR Streaming in Cognizant of Consumption Context for Improved QoE and Resource Usage Tradeoffs","authors":"Cheonjin Park, Chinmaey Shende, Subhabrata Sen, Bing Wang","doi":"10.1145/3652517","DOIUrl":"https://doi.org/10.1145/3652517","url":null,"abstract":"Smartphones have emerged as ubiquitous platforms for people to consume content in a wide range of consumption contexts (C2), e.g., over cellular or WiFi, playing back audio and video directly on phone or through peripheral devices such as external screens or speakers. In this paper, we argue that a user’s specific C2 is an important factor to consider in Adaptive Bitrate (ABR) streaming. We examine the current practices of using C2 in five popular ABR players, and identify various limitations in existing treatments that have a detrimental impact on network resource usage and user experience. We then formulate C2-cognizant ABR streaming as an optimization problem and develop practical best-practice guidelines to realize it. Instantiating these guidelines, we develop a proof-of-concept implementation in the widely used state-of-the-art ExoPlayer platform and demonstrate that it leads to significantly better tradeoffs in terms of user experience and resource usage. Last, we show that the guidelines also benefit <monospace>dash.js</monospace> player that uses an ABR logic significantly different from that of ExoPlayer.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"12 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140616852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0