ACM Transactions on Multimedia Computing Communications and Applications最新文献_第10页

Suitable and Style-consistent Multi-texture Recommendation for Cartoon Illustrations 适合卡通插图且风格一致的多纹理推荐

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-12 DOI: 10.1145/3652518

Huisi Wu, Zhaoze Wang, Yifan Li, Xueting Liu, Tong-Yee Lee

Texture plays an important role in cartoon illustrations to display object materials and enrich visual experiences. Unfortunately, manually designing and drawing an appropriate texture is not easy even for proficient artists, let alone novice or amateur people. While there exist tons of textures on the Internet, it is not easy to pick an appropriate one using traditional text-based search engines. Though several texture pickers have been proposed, they still require the users to browse the textures by themselves, which is still labor-intensive and time-consuming. In this paper, an automatic texture recommendation system is proposed for recommending multiple textures to replace a set of user-specified regions in a cartoon illustration with visually pleasant look. Two measurements, the suitability measurement and the style-consistency measurement, are proposed to make sure that the recommended textures are suitable for cartoon illustration and at the same time mutually consistent in style. The suitability is measured based on the synthesizability, cartoonity, and region fitness of textures. The style-consistency is predicted using a learning-based solution since it is subjective to judge whether two textures are consistent in style. An optimization problem is formulated and solved via the genetic algorithm. Our method is validated on various cartoon illustrations, and convincing results are obtained.

在卡通插图中，纹理在展示物体材质和丰富视觉体验方面发挥着重要作用。遗憾的是，手动设计和绘制合适的纹理即使对于熟练的艺术家来说也并非易事，更不用说新手或业余爱好者了。虽然互联网上存在大量纹理，但使用传统的基于文本的搜索引擎并不容易挑选到合适的纹理。虽然已经提出了一些纹理拾取器，但它们仍然需要用户自己浏览纹理，这仍然是一项耗费人力和时间的工作。本文提出了一种自动纹理推荐系统，用于推荐多种纹理，以替换用户指定的卡通插图中的一组区域，使其具有愉悦的视觉效果。为了确保推荐的纹理适合卡通插图，同时在风格上相互一致，本文提出了两种测量方法，即适合度测量和风格一致性测量。适合度的测量基于纹理的可合成性、卡通性和区域适合性。由于判断两种纹理的风格是否一致是主观的，因此使用基于学习的解决方案来预测风格一致性。我们提出了一个优化问题，并通过遗传算法加以解决。我们的方法在各种卡通插图上进行了验证，获得了令人信服的结果。

{"title":"Suitable and Style-consistent Multi-texture Recommendation for Cartoon Illustrations","authors":"Huisi Wu, Zhaoze Wang, Yifan Li, Xueting Liu, Tong-Yee Lee","doi":"10.1145/3652518","DOIUrl":"https://doi.org/10.1145/3652518","url":null,"abstract":"Texture plays an important role in cartoon illustrations to display object materials and enrich visual experiences. Unfortunately, manually designing and drawing an appropriate texture is not easy even for proficient artists, let alone novice or amateur people. While there exist tons of textures on the Internet, it is not easy to pick an appropriate one using traditional text-based search engines. Though several texture pickers have been proposed, they still require the users to browse the textures by themselves, which is still labor-intensive and time-consuming. In this paper, an automatic texture recommendation system is proposed for recommending multiple textures to replace a set of user-specified regions in a cartoon illustration with visually pleasant look. Two measurements, the suitability measurement and the style-consistency measurement, are proposed to make sure that the recommended textures are suitable for cartoon illustration and at the same time mutually consistent in style. The suitability is measured based on the synthesizability, cartoonity, and region fitness of textures. The style-consistency is predicted using a learning-based solution since it is subjective to judge whether two textures are consistent in style. An optimization problem is formulated and solved via the genetic algorithm. Our method is validated on various cartoon illustrations, and convincing results are obtained.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"37 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140106551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing 从多模态歌唱中自动转录歌词和自动转录音乐

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-12 DOI: 10.1145/3651310

Xiangming Gu, Longshen Ou, Wei Zeng, Jianan Zhang, Nicholas Wong, Ye Wang

Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.

自动歌词转录（ALT）是指将歌声转录成歌词，而自动音乐转录（AMT）是指将歌声转录成音符事件，即音乐 MIDI 音符。尽管这两项任务具有很大的实际应用潜力，但它们仍处于初级阶段。这是因为，由于伴奏等噪声污染的存在，仅从歌唱音频转录歌词和音符事件是众所周知的难题，这会导致歌唱歌词的可懂度和歌唱音符的可识别度下降。为了应对这一挑战，我们提出了实施多模态 ALT 和 AMT 系统的总体框架。此外，我们还策划了首个多模态歌唱数据集，包括 N20EMv1 和 N20EMv2，其中包含音频录音和唇部动作视频，以及歌词和音符事件的地面实况。在模型构建方面，我们建议采用语音领域的自监督学习模型作为声学编码器和视觉编码器，以缓解标记数据稀缺的问题。我们还引入了残差交叉关注机制，以有效整合音频和视频模式的特征。通过广泛的实验，我们证明了我们的单模态系统在 ALT 和 AMT 任务中都表现出了最先进的性能。随后，通过单模态实验，我们还探索了每种模态对多模态系统的单独贡献。最后，我们将这些实验结合起来，证明了我们提出的多模态系统的有效性，尤其是在噪声鲁棒性方面。

{"title":"Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing","authors":"Xiangming Gu, Longshen Ou, Wei Zeng, Jianan Zhang, Nicholas Wong, Ye Wang","doi":"10.1145/3651310","DOIUrl":"https://doi.org/10.1145/3651310","url":null,"abstract":"Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"45 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mastering Deepfake Detection: A Cutting-Edge Approach to Distinguish GAN and Diffusion-Model Images 掌握深度伪造检测：区分 GAN 和扩散模型图像的尖端方法

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-09 DOI: 10.1145/3652027

Luca Guarnera, Oliver Giudice, Sebastiano Battiato

Detecting and recognizing deepfakes is a pressing issue in the digital age. In this study, we first collected a dataset of pristine images and fake ones properly generated by nine different Generative Adversarial Network (GAN) architectures and four Diffusion Models (DM). The dataset contained a total of 83,000 images, with equal distribution between the real and deepfake data. Then, to address different deepfake detection and recognition tasks, we proposed a hierarchical multi-level approach. At the first level, we classified real images from AI-generated ones. At the second level, we distinguished between images generated by GANs and DMs. At the third level (composed of two additional sub-levels), we recognized the specific GAN and DM architectures used to generate the synthetic data. Experimental results demonstrated that our approach achieved more than 97% classification accuracy, outperforming existing state-of-the-art methods. The models obtained in the different levels turn out to be robust to various attacks such as JPEG compression (with different quality factor values) and resize (and others), demonstrating that the framework can be used and applied in real-world contexts (such as the analysis of multimedia data shared in the various social platforms) for support even in forensic investigations in order to counter the illicit use of these powerful and modern generative models. We are able to identify the specific GAN and DM architecture used to generate the image, which is critical in tracking down the source of the deepfake. Our hierarchical multi-level approach to deepfake detection and recognition shows promising results in identifying deepfakes allowing focus on underlying task by improving (about (2% ) on the average) standard multiclass flat detection systems. The proposed method has the potential to enhance the performance of deepfake detection systems, aid in the fight against the spread of fake images, and safeguard the authenticity of digital media.

检测和识别深度伪造是数字时代的一个紧迫问题。在这项研究中，我们首先收集了由九种不同的生成对抗网络（GAN）架构和四种扩散模型（DM）正确生成的原始图像和假图像数据集。该数据集共包含 83,000 张图像，真实数据和深度伪造数据分布均衡。然后，针对不同的深度伪造检测和识别任务，我们提出了一种分层多级方法。在第一层，我们对真实图像和人工智能生成的图像进行了分类。在第二层，我们区分了由 GAN 和 DM 生成的图像。在第三层（由另外两个子层组成），我们识别了用于生成合成数据的特定 GAN 和 DM 架构。实验结果表明，我们的方法达到了 97% 以上的分类准确率，超过了现有的最先进方法。在不同级别中获得的模型对各种攻击（如 JPEG 压缩（使用不同的质量因子值）和调整大小等）具有鲁棒性，这表明该框架可以在现实世界中使用和应用（如分析在各种社交平台上共享的多媒体数据），甚至可以在取证调查中提供支持，以打击非法使用这些强大的现代生成模型的行为。我们能够识别用于生成图像的特定 GAN 和 DM 架构，这对于追踪 deepfake 的来源至关重要。我们的分层多级深层伪造检测和识别方法在识别深层伪造方面取得了可喜的成果，通过改进（平均约为(2%)）标准多类平面检测系统，使我们能够专注于底层任务。所提出的方法有望提高深度赝品检测系统的性能，帮助打击假图像的传播，并保护数字媒体的真实性。

{"title":"Mastering Deepfake Detection: A Cutting-Edge Approach to Distinguish GAN and Diffusion-Model Images","authors":"Luca Guarnera, Oliver Giudice, Sebastiano Battiato","doi":"10.1145/3652027","DOIUrl":"https://doi.org/10.1145/3652027","url":null,"abstract":"Detecting and recognizing deepfakes is a pressing issue in the digital age. In this study, we first collected a dataset of pristine images and fake ones properly generated by nine different Generative Adversarial Network (GAN) architectures and four Diffusion Models (DM). The dataset contained a total of 83,000 images, with equal distribution between the real and deepfake data. Then, to address different deepfake detection and recognition tasks, we proposed a hierarchical multi-level approach. At the first level, we classified real images from AI-generated ones. At the second level, we distinguished between images generated by GANs and DMs. At the third level (composed of two additional sub-levels), we recognized the specific GAN and DM architectures used to generate the synthetic data. Experimental results demonstrated that our approach achieved more than 97% classification accuracy, outperforming existing state-of-the-art methods. The models obtained in the different levels turn out to be robust to various attacks such as JPEG compression (with different quality factor values) and resize (and others), demonstrating that the framework can be used and applied in real-world contexts (such as the analysis of multimedia data shared in the various social platforms) for support even in forensic investigations in order to counter the illicit use of these powerful and modern generative models. We are able to identify the specific GAN and DM architecture used to generate the image, which is critical in tracking down the source of the deepfake. Our hierarchical multi-level approach to deepfake detection and recognition shows promising results in identifying deepfakes allowing focus on underlying task by improving (about (2% ) on the average) standard multiclass flat detection systems. The proposed method has the potential to enhance the performance of deepfake detection systems, aid in the fight against the spread of fake images, and safeguard the authenticity of digital media.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"6 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal 利用一致性约束和手语人移除改进连续手语识别

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-08 DOI: 10.1145/3640815

Ronglai Zuo, Brian Mak

Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks to enhance CSLR backbones. First, we enhance the visual module, which is particularly sensitive to the challenges posed by limited training samples, from the perspective of consistency. Specifically, since sign languages primarily rely on signers’ facial expressions and hand movements to convey information, we develop a keypoint-guided spatial attention module that directs the visual module to focus on informative regions, thereby ensuring spatial attention consistency. Furthermore, recognizing that the output features of both the visual and sequential modules represent the same sentence, we leverage this prior knowledge to better exploit the power of the backbone. We impose a sentence embedding consistency constraint between the visual and sequential modules, enhancing the representation power of both features. The resulting CSLR model, referred to as consistency-enhanced CSLR, demonstrates superior performance on signer-dependent datasets, where all signers appear during both training and testing. To enhance its robustness for the signer-independent setting, we propose a signer removal module based on feature disentanglement, effectively eliminating signer-specific information from the backbone. To validate the effectiveness of the proposed auxiliary tasks, we conduct extensive ablation studies. Notably, utilizing a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, including PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

基于深度学习的连续手语识别（CSLR）模型通常由视觉模块、顺序模块和配准模块组成。然而，由于训练样本有限，训练此类 CSLR 骨干的有效性受到阻碍，因此使用单一的连接主义时序分类损失是不够的。为了解决这一局限性，我们提出了三个辅助任务来增强 CSLR 骨干。首先，我们从一致性的角度出发，增强了对有限训练样本所带来的挑战尤为敏感的视觉模块。具体来说，由于手语主要依靠手语者的面部表情和手部动作来传递信息，我们开发了一个关键点引导的空间注意力模块，引导视觉模块关注信息区域，从而确保空间注意力的一致性。此外，我们认识到视觉模块和顺序模块的输出特性代表了同一个句子，因此我们利用这一先验知识来更好地利用主干的力量。我们在视觉模块和顺序模块之间施加了句子嵌入一致性约束，从而增强了这两个特征的表征能力。由此产生的 CSLR 模型被称为 "一致性增强 CSLR"，它在依赖于签名者的数据集上表现出了卓越的性能，在这些数据集上，所有签名者都会在训练和测试过程中出现。为了增强其在独立于签名者的环境中的鲁棒性，我们提出了一个基于特征分离的签名者移除模块，从而有效地消除了主干中特定于签名者的信息。为了验证所提出的辅助任务的有效性，我们进行了广泛的消减研究。值得注意的是，利用基于变压器的骨干网，我们的模型在五项基准测试（包括 PHOENIX-2014、PHOENIX-2014-T、PHOENIX-2014-SI、CSL 和 CSL-Daily）中取得了最先进或具有竞争力的性能。代码和模型见 https://github.com/2000ZRL/LCSA_C2SLR_SRM。

{"title":"Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal","authors":"Ronglai Zuo, Brian Mak","doi":"10.1145/3640815","DOIUrl":"https://doi.org/10.1145/3640815","url":null,"abstract":"Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks to enhance CSLR backbones. First, we enhance the visual module, which is particularly sensitive to the challenges posed by limited training samples, from the perspective of consistency. Specifically, since sign languages primarily rely on signers’ facial expressions and hand movements to convey information, we develop a keypoint-guided spatial attention module that directs the visual module to focus on informative regions, thereby ensuring spatial attention consistency. Furthermore, recognizing that the output features of both the visual and sequential modules represent the same sentence, we leverage this prior knowledge to better exploit the power of the backbone. We impose a sentence embedding consistency constraint between the visual and sequential modules, enhancing the representation power of both features. The resulting CSLR model, referred to as consistency-enhanced CSLR, demonstrates superior performance on signer-dependent datasets, where all signers appear during both training and testing. To enhance its robustness for the signer-independent setting, we propose a signer removal module based on feature disentanglement, effectively eliminating signer-specific information from the backbone. To validate the effectiveness of the proposed auxiliary tasks, we conduct extensive ablation studies. Notably, utilizing a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, including PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"53 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective RGBT 单目标跟踪方法回顾与分析：融合视角

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-07 DOI: 10.1145/3651308

ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, Zhang Bo

Visual tracking is a fundamental task in computer vision with significant practical applications in various domains, including surveillance, security, robotics, and human-computer interaction. However, it may face limitations in visible light data, such as low-light environments, occlusion, and camouflage, which can significantly reduce its accuracy. To cope with these challenges, researchers have explored the potential of combining the visible and infrared modalities to improve tracking performance. By leveraging the complementary strengths of visible and infrared data, RGB-infrared fusion tracking has emerged as a promising approach to address these limitations and improve tracking accuracy in challenging scenarios. In this paper, we present a review on RGB-infrared fusion tracking. Specifically, we categorize existing RGBT tracking methods into four categories based on their underlying architectures, feature representations, and fusion strategies, namely feature decoupling based method, feature selecting based method, collaborative graph tracking method, and traditional fusion method. Furthermore, we provide a critical analysis of their strengths, limitations, representative methods, and future research directions. To further demonstrate the advantages and disadvantages of these methods, we present a review of publicly available RGBT tracking datasets and analyze the main results on public datasets. Moreover,we discuss some limitations in RGBT tracking at present and provide some opportunities and future directions for RGBT visual tracking, such as dataset diversity, unsupervised and weakly supervised applications. In conclusion, our survey aims to serve as a useful resource for researchers and practitioners interested in the emerging field of RGBT tracking, and to promote further progress and innovation in this area.

视觉跟踪是计算机视觉中的一项基本任务，在监控、安全、机器人和人机交互等多个领域都有重要的实际应用。然而，在可见光数据中，视觉跟踪可能会面临一些限制，例如弱光环境、遮挡和伪装，这些都会大大降低视觉跟踪的准确性。为了应对这些挑战，研究人员探索了结合可见光和红外模式来提高跟踪性能的潜力。通过利用可见光和红外数据的互补优势，RGB-红外融合跟踪已成为一种很有前途的方法，可解决这些局限性并提高挑战性场景中的跟踪精度。在本文中，我们对 RGB 红外融合跟踪进行了综述。具体来说，我们将现有的 RGBT 跟踪方法根据其底层架构、特征表示和融合策略分为四类，即基于特征解耦的方法、基于特征选择的方法、协作图跟踪方法和传统融合方法。此外，我们还对它们的优势、局限性、代表性方法和未来研究方向进行了批判性分析。为了进一步说明这些方法的优缺点，我们回顾了公开的 RGBT 跟踪数据集，并分析了公开数据集的主要结果。此外，我们还讨论了目前 RGBT 跟踪的一些局限性，并为 RGBT 视觉跟踪提供了一些机遇和未来发展方向，如数据集多样性、无监督和弱监督应用等。总之，我们的调查旨在为对新兴的 RGBT 跟踪领域感兴趣的研究人员和从业人员提供有用的资源，并促进该领域的进一步进步和创新。

{"title":"Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective","authors":"ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, Zhang Bo","doi":"10.1145/3651308","DOIUrl":"https://doi.org/10.1145/3651308","url":null,"abstract":"Visual tracking is a fundamental task in computer vision with significant practical applications in various domains, including surveillance, security, robotics, and human-computer interaction. However, it may face limitations in visible light data, such as low-light environments, occlusion, and camouflage, which can significantly reduce its accuracy. To cope with these challenges, researchers have explored the potential of combining the visible and infrared modalities to improve tracking performance. By leveraging the complementary strengths of visible and infrared data, RGB-infrared fusion tracking has emerged as a promising approach to address these limitations and improve tracking accuracy in challenging scenarios. In this paper, we present a review on RGB-infrared fusion tracking. Specifically, we categorize existing RGBT tracking methods into four categories based on their underlying architectures, feature representations, and fusion strategies, namely feature decoupling based method, feature selecting based method, collaborative graph tracking method, and traditional fusion method. Furthermore, we provide a critical analysis of their strengths, limitations, representative methods, and future research directions. To further demonstrate the advantages and disadvantages of these methods, we present a review of publicly available RGBT tracking datasets and analyze the main results on public datasets. Moreover,we discuss some limitations in RGBT tracking at present and provide some opportunities and future directions for RGBT visual tracking, such as dataset diversity, unsupervised and weakly supervised applications. In conclusion, our survey aims to serve as a useful resource for researchers and practitioners interested in the emerging field of RGBT tracking, and to promote further progress and innovation in this area.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Backdoor Two-Stream Video Models on Federated Learning 基于联合学习的后门双流视频模型

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-07 DOI: 10.1145/3651307

Jing Zhao, Hongwei Yang, Hui He, Jie Peng, Weizhe Zhang, Jiangqun Ni, Arun Kumar Sangaiah, Aniello Castiglione

Video models on federated learning (FL) enable continual learning of the involved models for video tasks on end-user devices while protecting the privacy of end-user data. As a result, the security issues on FL, e.g., the backdoor attacks on FL and their defense have increasingly becoming the domains of extensive research in recent years. The backdoor attacks on FL are a class of poisoning attacks, in which an attacker, as one of the training participants, submits poisoned parameters and thus injects the backdoor into the global model after aggregation. Existing backdoor attacks against videos based on FL only poison RGB frames, which makes that the attack could be easily mitigated by two-stream model neutralization. Therefore, it is a big challenge to manipulate the most advanced two-stream video model with a high success rate by poisoning only a small proportion of training data in the framework of FL. In this paper, a new backdoor attack scheme incorporating the rich spatial and temporal structures of video data is proposed, which injects the backdoor triggers into both the optical flow and RGB frames of video data through multiple rounds of model aggregations. In addition, the adversarial attack is utilized on the RGB frames to further boost the robustness of the attacks. Extensive experiments on real-world datasets verify that our methods outperform the state-of-the-art backdoor attacks and show better performance in terms of stealthiness and persistence.

联合学习（FL）上的视频模型能够在保护终端用户数据隐私的同时，持续学习终端用户设备上视频任务的相关模型。因此，FL 的安全问题，如 FL 的后门攻击及其防御，近年来日益成为广泛研究的领域。对 FL 的后门攻击是一类中毒攻击，即攻击者作为训练参与者之一，提交中毒参数，从而将后门注入聚合后的全局模型。现有的针对基于 FL 的视频的后门攻击只对 RGB 帧投毒，这使得这种攻击很容易通过双流模型中和来缓解。因此，如何在 FL 框架内只对一小部分训练数据下毒，就能以较高的成功率操纵最先进的双流视频模型，是一个巨大的挑战。本文结合视频数据丰富的时空结构，提出了一种新的后门攻击方案，通过多轮模型聚合，将后门触发器同时注入视频数据的光流和 RGB 帧中。此外，还对 RGB 帧使用了对抗攻击，以进一步提高攻击的鲁棒性。在真实世界数据集上进行的大量实验验证了我们的方法优于最先进的后门攻击，并在隐蔽性和持久性方面表现出更好的性能。

{"title":"Backdoor Two-Stream Video Models on Federated Learning","authors":"Jing Zhao, Hongwei Yang, Hui He, Jie Peng, Weizhe Zhang, Jiangqun Ni, Arun Kumar Sangaiah, Aniello Castiglione","doi":"10.1145/3651307","DOIUrl":"https://doi.org/10.1145/3651307","url":null,"abstract":"Video models on federated learning (FL) enable continual learning of the involved models for video tasks on end-user devices while protecting the privacy of end-user data. As a result, the security issues on FL, e.g., the backdoor attacks on FL and their defense have increasingly becoming the domains of extensive research in recent years. The backdoor attacks on FL are a class of poisoning attacks, in which an attacker, as one of the training participants, submits poisoned parameters and thus injects the backdoor into the global model after aggregation. Existing backdoor attacks against videos based on FL only poison RGB frames, which makes that the attack could be easily mitigated by two-stream model neutralization. Therefore, it is a big challenge to manipulate the most advanced two-stream video model with a high success rate by poisoning only a small proportion of training data in the framework of FL. In this paper, a new backdoor attack scheme incorporating the rich spatial and temporal structures of video data is proposed, which injects the backdoor triggers into both the optical flow and RGB frames of video data through multiple rounds of model aggregations. In addition, the adversarial attack is utilized on the RGB frames to further boost the robustness of the attacks. Extensive experiments on real-world datasets verify that our methods outperform the state-of-the-art backdoor attacks and show better performance in terms of stealthiness and persistence.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"122 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Delay threshold for social interaction in volumetric eXtended Reality communication 体积扩展现实通信中社交互动的延迟阈值

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-06 DOI: 10.1145/3651164

Carlos Cortés, Irene Viola, Jesús Gutiérrez, Jack Jansen, Shishir Subramanyam, Evangelos Alexiou, Pablo Pérez, Narciso García, Pablo César

Immersive technologies like eXtended Reality (XR) are the next step in videoconferencing. In this context, understanding the effect of delay on communication is crucial. This paper presents the first study on the impact of delay on collaborative tasks using a realistic Social XR system. Specifically, we design an experiment and evaluate the impact of end-to-end delays of 300, 600, 900, 1200, and 1500 ms on the execution of a standardized task involving the collaboration of two remote users that meet in a virtual space and construct block-based shapes. To measure the impact of the delay in this communication scenario, objective and subjective data were collected. As objective data, we measured the time required to execute the tasks and computed conversational characteristics by analysing the recorded audio signals. As subjective data, a questionnaire was prepared and completed by every user to evaluate different factors such as overall quality, perception of delay, annoyance using the system, level of presence, cybersickness, and other subjective factors associated with social interaction. The results show a clear influence of the delay on the perceived quality and a significant negative effect as the delay increases. Specifically, the results indicate that the acceptable threshold for end-to-end delay should not exceed 900 ms. This article, additionally provides guidelines for developing standardized XR tasks for assessing interaction in Social XR environments.

电子扩展现实（XR）等沉浸式技术是视频会议的下一个发展方向。在这种情况下，了解延迟对交流的影响至关重要。本文首次使用现实社交 XR 系统研究了延迟对协作任务的影响。具体来说，我们设计了一个实验，并评估了 300、600、900、1200 和 1500 毫秒的端到端延迟对执行标准化任务的影响，该任务涉及两个远程用户在虚拟空间中会面并构建基于块的图形的协作。为了测量延迟在这一通信场景中的影响，我们收集了客观和主观数据。作为客观数据，我们测量了执行任务所需的时间，并通过分析录制的音频信号计算了对话特征。作为主观数据，我们为每位用户准备并填写了一份问卷，以评估不同的因素，如整体质量、对延迟的感知、使用系统的烦恼、存在感、晕机感以及与社交互动相关的其他主观因素。结果表明，延迟对感知质量有明显的影响，随着延迟的增加，会产生显著的负面影响。具体而言，结果表明端到端延迟的可接受阈值不应超过 900 毫秒。这篇文章还为开发用于评估社交 XR 环境中互动的标准化 XR 任务提供了指导。

{"title":"Delay threshold for social interaction in volumetric eXtended Reality communication","authors":"Carlos Cortés, Irene Viola, Jesús Gutiérrez, Jack Jansen, Shishir Subramanyam, Evangelos Alexiou, Pablo Pérez, Narciso García, Pablo César","doi":"10.1145/3651164","DOIUrl":"https://doi.org/10.1145/3651164","url":null,"abstract":"Immersive technologies like eXtended Reality (XR) are the next step in videoconferencing. In this context, understanding the effect of delay on communication is crucial. This paper presents the first study on the impact of delay on collaborative tasks using a realistic Social XR system. Specifically, we design an experiment and evaluate the impact of end-to-end delays of 300, 600, 900, 1200, and 1500 ms on the execution of a standardized task involving the collaboration of two remote users that meet in a virtual space and construct block-based shapes. To measure the impact of the delay in this communication scenario, objective and subjective data were collected. As objective data, we measured the time required to execute the tasks and computed conversational characteristics by analysing the recorded audio signals. As subjective data, a questionnaire was prepared and completed by every user to evaluate different factors such as overall quality, perception of delay, annoyance using the system, level of presence, cybersickness, and other subjective factors associated with social interaction. The results show a clear influence of the delay on the perceived quality and a significant negative effect as the delay increases. Specifically, the results indicate that the acceptable threshold for end-to-end delay should not exceed 900 ms. This article, additionally provides guidelines for developing standardized XR tasks for assessing interaction in Social XR environments.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"32 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced Video Super-Resolution Network Towards Compressed Data 面向压缩数据的增强型视频超分辨率网络

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-06 DOI: 10.1145/3651309

Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, Yao Zhao

Video super-resolution (VSR) algorithms aim at recovering a temporally consistent high-resolution (HR) video from its corresponding low-resolution (LR) video sequence. Due to the limited bandwidth during video transmission, most available videos on the internet are compressed. Nevertheless, few existing algorithms consider the compression factor in practical applications. In this paper, we propose an enhanced VSR model towards compressed videos, termed as ECVSR, to simultaneously achieve compression artifacts reduction and SR reconstruction end-to-end. ECVSR contains a motion-excited temporal adaption network (METAN) and a multi-frame SR network (SRNet). The METAN takes decoded LR video frames as input and models inter-frame correlations via bidirectional deformable alignment and motion-excited temporal adaption, where temporal differences are calculated as motion prior to excite the motion-sensitive regions of temporal features. In SRNet, cascaded recurrent multi-scale blocks (RMSB) are employed to learn deep spatio-temporal representations from adapted multi-frame features. Then, we build a reconstruction module for spatio-temporal information integration and HR frame reconstruction, which is followed by a detail refinement module for texture and visual quality enhancement. Extensive experimental results on compressed videos demonstrate the superiority of our method for compressed VSR. Code will be available at https://github.com/lifengcs/ECVSR.

视频超分辨率（VSR）算法旨在从相应的低分辨率（LR）视频序列中恢复出时间上一致的高分辨率（HR）视频。由于视频传输过程中带宽有限，互联网上的大多数视频都经过了压缩。然而，现有算法很少考虑实际应用中的压缩因素。在本文中，我们提出了一种针对压缩视频的增强型 VSR 模型（称为 ECVSR），以同时实现压缩伪影的减少和端到端的 SR 重建。ECVSR 包含一个运动激发时间自适应网络（METAN）和一个多帧 SR 网络（SRNet）。METAN 将解码的 LR 视频帧作为输入，并通过双向可变形对齐和运动激发时序自适应建立帧间相关性模型，其中时序差异被计算为运动先验，以激发时序特征的运动敏感区域。在 SRNet 中，采用级联递归多尺度块（RMSB）从适应的多帧特征中学习深度时空表示。然后，我们建立了一个用于时空信息整合和 HR 帧重建的重构模块，接着是一个用于纹理和视觉质量增强的细节细化模块。压缩视频的大量实验结果证明了我们的方法在压缩 VSR 方面的优越性。代码将发布在 https://github.com/lifengcs/ECVSR 网站上。

{"title":"Enhanced Video Super-Resolution Network Towards Compressed Data","authors":"Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, Yao Zhao","doi":"10.1145/3651309","DOIUrl":"https://doi.org/10.1145/3651309","url":null,"abstract":"Video super-resolution (VSR) algorithms aim at recovering a temporally consistent high-resolution (HR) video from its corresponding low-resolution (LR) video sequence. Due to the limited bandwidth during video transmission, most available videos on the internet are compressed. Nevertheless, few existing algorithms consider the compression factor in practical applications. In this paper, we propose an enhanced VSR model towards compressed videos, termed as ECVSR, to simultaneously achieve compression artifacts reduction and SR reconstruction end-to-end. ECVSR contains a motion-excited temporal adaption network (METAN) and a multi-frame SR network (SRNet). The METAN takes decoded LR video frames as input and models inter-frame correlations via bidirectional deformable alignment and motion-excited temporal adaption, where temporal differences are calculated as motion prior to excite the motion-sensitive regions of temporal features. In SRNet, cascaded recurrent multi-scale blocks (RMSB) are employed to learn deep spatio-temporal representations from adapted multi-frame features. Then, we build a reconstruction module for spatio-temporal information integration and HR frame reconstruction, which is followed by a detail refinement module for texture and visual quality enhancement. Extensive experimental results on compressed videos demonstrate the superiority of our method for compressed VSR. Code will be available at https://github.com/lifengcs/ECVSR.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"75 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Quality of Experience and Visual Attention Evaluation for 360° videos with non-spatial and spatial audio 对带有非空间和空间音频的 360° 视频进行体验质量和视觉注意力评估

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-06 DOI: 10.1145/3650208

Amit Hirway, Yuansong Qiao, Niall Murray

This article presents the results of an empirical study that aimed to investigate the influence of various types of audio (spatial and non-spatial) on the user quality of experience (QoE) of and visual attention in 360° videos. The study compared the head pose, eye gaze, pupil dilations, heart rate and subjective responses of 73 users who watched ten 360° videos with different sound configurations. The configurations evaluated were no sound; non-spatial (stereo) audio; and two spatial sound conditions (first and third-order ambisonics). The videos covered various categories and presented both indoor and outdoor scenarios. The subjective responses were analyzed using an ANOVA (Analysis of Variance) to assess mean differences between sound conditions. Data visualization was also employed to enhance the interpretability of the results. The findings reveal diverse viewing patterns, physiological responses, and subjective experiences among users watching 360° videos with different sound conditions. Spatial audio, in particular third-order ambisonics, garnered heightened attention. This is evident in increased pupil dilation and heart rate. Furthermore, the presence of spatial audio led to more diverse head poses when sound sources were distributed across the scene. These findings have important implications for the development of effective techniques for optimizing processing, encoding, distributing, and rendering content in VR and 360° videos with spatialized audio. These insights are also relevant in the creative realms of content design and enhancement. They provide valuable guidance on how spatial audio influences user attention, physiological responses, and overall subjective experiences. Understanding these dynamics can assist content creators and designers in crafting immersive experiences that leverage spatialized audio to captivate users, enhance engagement, and optimize the overall quality of virtual reality and 360° video content. The dataset, scripts used for data collection, ffmpeg commands used for processing the videos and the subjective questionnaire and its statistical analysis are publicly available.

本文介绍了一项实证研究的结果，该研究旨在调查各种类型的音频（空间和非空间）对 360° 视频的用户体验质量（QoE）和视觉注意力的影响。该研究比较了 73 位用户在观看 10 个不同声音配置的 360° 视频时的头部姿势、眼睛注视、瞳孔放大、心率和主观反应。评估的配置包括无声、非空间（立体声）音频和两种空间声音条件（一阶和三阶环境声）。这些视频涵盖了各种类别，并呈现了室内和室外场景。使用方差分析对主观反应进行分析，以评估不同声音条件下的平均差异。此外，还采用了数据可视化方法来提高结果的可解释性。研究结果揭示了用户在不同声音条件下观看 360° 视频时的不同观看模式、生理反应和主观体验。空间音频，尤其是三阶环境声，获得了更高的关注度。这表现在瞳孔放大和心率加快上。此外，当声源分布在整个场景中时，空间音频的存在会导致头部姿势更加多样化。这些发现对于开发有效的技术以优化处理、编码、分发和渲染带有空间音频的 VR 和 360° 视频内容具有重要意义。这些见解也与内容设计和增强的创意领域相关。它们为空间音频如何影响用户注意力、生理反应和整体主观体验提供了宝贵的指导。了解这些动态变化有助于内容创作者和设计师利用空间音频打造身临其境的体验，从而吸引用户、提高参与度并优化虚拟现实和 360° 视频内容的整体质量。数据集、用于数据收集的脚本、用于处理视频的 ffmpeg 命令、主观问卷及其统计分析均可公开获取。

{"title":"A Quality of Experience and Visual Attention Evaluation for 360° videos with non-spatial and spatial audio","authors":"Amit Hirway, Yuansong Qiao, Niall Murray","doi":"10.1145/3650208","DOIUrl":"https://doi.org/10.1145/3650208","url":null,"abstract":"This article presents the results of an empirical study that aimed to investigate the influence of various types of audio (spatial and non-spatial) on the user quality of experience (QoE) of and visual attention in 360° videos. The study compared the head pose, eye gaze, pupil dilations, heart rate and subjective responses of 73 users who watched ten 360° videos with different sound configurations. The configurations evaluated were no sound; non-spatial (stereo) audio; and two spatial sound conditions (first and third-order ambisonics). The videos covered various categories and presented both indoor and outdoor scenarios. The subjective responses were analyzed using an ANOVA (Analysis of Variance) to assess mean differences between sound conditions. Data visualization was also employed to enhance the interpretability of the results. The findings reveal diverse viewing patterns, physiological responses, and subjective experiences among users watching 360° videos with different sound conditions. Spatial audio, in particular third-order ambisonics, garnered heightened attention. This is evident in increased pupil dilation and heart rate. Furthermore, the presence of spatial audio led to more diverse head poses when sound sources were distributed across the scene. These findings have important implications for the development of effective techniques for optimizing processing, encoding, distributing, and rendering content in VR and 360° videos with spatialized audio. These insights are also relevant in the creative realms of content design and enhancement. They provide valuable guidance on how spatial audio influences user attention, physiological responses, and overall subjective experiences. Understanding these dynamics can assist content creators and designers in crafting immersive experiences that leverage spatialized audio to captivate users, enhance engagement, and optimize the overall quality of virtual reality and 360° video content. The dataset, scripts used for data collection, ffmpeg commands used for processing the videos and the subjective questionnaire and its statistical analysis are publicly available.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"43 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GreenABR+: Generalized Energy-Aware Adaptive Bitrate Streaming GreenABR+：广义能量感知自适应比特率流媒体

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-05 DOI: 10.1145/3649898

Bekir Oguzhan Turkkan, Ting Dai, Adithya Raman, Tevfik Kosar, Changyou Chen, Muhammed Fatih Bulut, Jaroslaw Zola, Daby Sow

Adaptive bitrate (ABR) algorithms play a critical role in video streaming by making optimal bitrate decisions in dynamically changing network conditions to provide a high quality of experience (QoE) for users. However, most existing ABRs suffer from limitations such as predefined rules and incorrect assumptions about streaming parameters. They often prioritize higher bitrates and ignore the corresponding energy footprint, resulting in increased energy consumption, especially for mobile device users. Additionally, most ABR algorithms do not consider perceived quality, leading to suboptimal user experience. This paper proposes a novel ABR scheme called GreenABR+, which utilizes deep reinforcement learning to optimize energy consumption during video streaming while maintaining high user QoE. Unlike existing rule-based ABR algorithms, GreenABR+ makes no assumptions about video settings or the streaming environment. GreenABR+ model works on different video representation sets and can adapt to dynamically changing conditions in a wide range of network scenarios. Our experiments demonstrate that GreenABR+ outperforms state-of-the-art ABR algorithms by saving up to 57% in streaming energy consumption and 57% in data consumption while providing up to 25% more perceptual QoE due to up to 87% less rebuffering time and near-zero capacity violations. The generalization and dynamic adaptability make GreenABR+ a flexible solution for energy-efficient ABR optimization.

自适应比特率（ABR）算法在视频流中发挥着至关重要的作用，它能在动态变化的网络条件下做出最佳比特率决策，为用户提供高质量的体验（QoE）。然而，大多数现有 ABR 算法都存在一些局限性，如预定义规则和对流媒体参数的不正确假设。它们通常优先考虑较高的比特率，而忽略了相应的能耗，导致能耗增加，尤其是对移动设备用户而言。此外，大多数 ABR 算法不考虑感知质量，导致用户体验不佳。本文提出了一种名为 GreenABR+ 的新型 ABR 方案，它利用深度强化学习来优化视频流期间的能耗，同时保持较高的用户 QoE。与现有的基于规则的 ABR 算法不同，GreenABR+ 不对视频设置或流媒体环境做任何假设。GreenABR+ 模型适用于不同的视频表示集，并能适应各种网络场景中动态变化的条件。我们的实验证明，GreenABR+ 优于最先进的 ABR 算法，可节省高达 57% 的流能耗和 57% 的数据消耗，同时由于减少了高达 87% 的回弹时间和近乎零的容量违规，可提供高达 25% 的感知 QoE。通用性和动态适应性使 GreenABR+ 成为高能效 ABR 优化的灵活解决方案。

{"title":"GreenABR+: Generalized Energy-Aware Adaptive Bitrate Streaming","authors":"Bekir Oguzhan Turkkan, Ting Dai, Adithya Raman, Tevfik Kosar, Changyou Chen, Muhammed Fatih Bulut, Jaroslaw Zola, Daby Sow","doi":"10.1145/3649898","DOIUrl":"https://doi.org/10.1145/3649898","url":null,"abstract":"Adaptive bitrate (ABR) algorithms play a critical role in video streaming by making optimal bitrate decisions in dynamically changing network conditions to provide a high quality of experience (QoE) for users. However, most existing ABRs suffer from limitations such as predefined rules and incorrect assumptions about streaming parameters. They often prioritize higher bitrates and ignore the corresponding energy footprint, resulting in increased energy consumption, especially for mobile device users. Additionally, most ABR algorithms do not consider perceived quality, leading to suboptimal user experience. This paper proposes a novel ABR scheme called GreenABR+, which utilizes deep reinforcement learning to optimize energy consumption during video streaming while maintaining high user QoE. Unlike existing rule-based ABR algorithms, GreenABR+ makes no assumptions about video settings or the streaming environment. GreenABR+ model works on different video representation sets and can adapt to dynamically changing conditions in a wide range of network scenarios. Our experiments demonstrate that GreenABR+ outperforms state-of-the-art ABR algorithms by saving up to 57% in streaming energy consumption and 57% in data consumption while providing up to 25% more perceptual QoE due to up to 87% less rebuffering time and near-zero capacity violations. The generalization and dynamic adaptability make GreenABR+ a flexible solution for energy-efficient ABR optimization.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"237 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0