Huisi Wu, Zhaoze Wang, Yifan Li, Xueting Liu, Tong-Yee Lee
Texture plays an important role in cartoon illustrations to display object materials and enrich visual experiences. Unfortunately, manually designing and drawing an appropriate texture is not easy even for proficient artists, let alone novice or amateur people. While there exist tons of textures on the Internet, it is not easy to pick an appropriate one using traditional text-based search engines. Though several texture pickers have been proposed, they still require the users to browse the textures by themselves, which is still labor-intensive and time-consuming. In this paper, an automatic texture recommendation system is proposed for recommending multiple textures to replace a set of user-specified regions in a cartoon illustration with visually pleasant look. Two measurements, the suitability measurement and the style-consistency measurement, are proposed to make sure that the recommended textures are suitable for cartoon illustration and at the same time mutually consistent in style. The suitability is measured based on the synthesizability, cartoonity, and region fitness of textures. The style-consistency is predicted using a learning-based solution since it is subjective to judge whether two textures are consistent in style. An optimization problem is formulated and solved via the genetic algorithm. Our method is validated on various cartoon illustrations, and convincing results are obtained.
{"title":"Suitable and Style-consistent Multi-texture Recommendation for Cartoon Illustrations","authors":"Huisi Wu, Zhaoze Wang, Yifan Li, Xueting Liu, Tong-Yee Lee","doi":"10.1145/3652518","DOIUrl":"https://doi.org/10.1145/3652518","url":null,"abstract":"<p>Texture plays an important role in cartoon illustrations to display object materials and enrich visual experiences. Unfortunately, manually designing and drawing an appropriate texture is not easy even for proficient artists, let alone novice or amateur people. While there exist tons of textures on the Internet, it is not easy to pick an appropriate one using traditional text-based search engines. Though several texture pickers have been proposed, they still require the users to browse the textures by themselves, which is still labor-intensive and time-consuming. In this paper, an automatic texture recommendation system is proposed for recommending multiple textures to replace a set of user-specified regions in a cartoon illustration with visually pleasant look. Two measurements, the suitability measurement and the style-consistency measurement, are proposed to make sure that the recommended textures are suitable for cartoon illustration and at the same time mutually consistent in style. The suitability is measured based on the synthesizability, cartoonity, and region fitness of textures. The style-consistency is predicted using a learning-based solution since it is subjective to judge whether two textures are consistent in style. An optimization problem is formulated and solved via the genetic algorithm. Our method is validated on various cartoon illustrations, and convincing results are obtained.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"37 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140106551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangming Gu, Longshen Ou, Wei Zeng, Jianan Zhang, Nicholas Wong, Ye Wang
Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.
自动歌词转录(ALT)是指将歌声转录成歌词,而自动音乐转录(AMT)是指将歌声转录成音符事件,即音乐 MIDI 音符。尽管这两项任务具有很大的实际应用潜力,但它们仍处于初级阶段。这是因为,由于伴奏等噪声污染的存在,仅从歌唱音频转录歌词和音符事件是众所周知的难题,这会导致歌唱歌词的可懂度和歌唱音符的可识别度下降。为了应对这一挑战,我们提出了实施多模态 ALT 和 AMT 系统的总体框架。此外,我们还策划了首个多模态歌唱数据集,包括 N20EMv1 和 N20EMv2,其中包含音频录音和唇部动作视频,以及歌词和音符事件的地面实况。在模型构建方面,我们建议采用语音领域的自监督学习模型作为声学编码器和视觉编码器,以缓解标记数据稀缺的问题。我们还引入了残差交叉关注机制,以有效整合音频和视频模式的特征。通过广泛的实验,我们证明了我们的单模态系统在 ALT 和 AMT 任务中都表现出了最先进的性能。随后,通过单模态实验,我们还探索了每种模态对多模态系统的单独贡献。最后,我们将这些实验结合起来,证明了我们提出的多模态系统的有效性,尤其是在噪声鲁棒性方面。
{"title":"Automatic Lyric Transcription and Automatic Music Transcription from Multimodal Singing","authors":"Xiangming Gu, Longshen Ou, Wei Zeng, Jianan Zhang, Nicholas Wong, Ye Wang","doi":"10.1145/3651310","DOIUrl":"https://doi.org/10.1145/3651310","url":null,"abstract":"<p>Automatic lyric transcription (ALT) refers to transcribing singing voices into lyrics while automatic music transcription (AMT) refers to transcribing singing voices into note events, i.e., musical MIDI notes. Despite these two tasks having significant potential for practical application, they are still nascent. This is because the transcription of lyrics and note events solely from singing audio is notoriously difficult due to the presence of noise contamination, e.g., musical accompaniment, resulting in a degradation of both the intelligibility of sung lyrics and the recognizability of sung notes. To address this challenge, we propose a general framework for implementing multimodal ALT and AMT systems. Additionally, we curate the first multimodal singing dataset, comprising N20EMv1 and N20EMv2, which encompasses audio recordings and videos of lip movements, together with ground truth for lyrics and note events. For model construction, we propose adapting self-supervised learning models from the speech domain as acoustic encoders and visual encoders to alleviate the scarcity of labeled data. We also introduce a residual cross-attention mechanism to effectively integrate features from the audio and video modalities. Through extensive experiments, we demonstrate that our single-modal systems exhibit state-of-the-art performance on both ALT and AMT tasks. Subsequently, through single-modal experiments, we also explore the individual contributions of each modality to the multimodal system. Finally, we combine these and demonstrate the effectiveness of our proposed multimodal systems, particularly in terms of their noise robustness.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"45 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140129838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luca Guarnera, Oliver Giudice, Sebastiano Battiato
Detecting and recognizing deepfakes is a pressing issue in the digital age. In this study, we first collected a dataset of pristine images and fake ones properly generated by nine different Generative Adversarial Network (GAN) architectures and four Diffusion Models (DM). The dataset contained a total of 83,000 images, with equal distribution between the real and deepfake data. Then, to address different deepfake detection and recognition tasks, we proposed a hierarchical multi-level approach. At the first level, we classified real images from AI-generated ones. At the second level, we distinguished between images generated by GANs and DMs. At the third level (composed of two additional sub-levels), we recognized the specific GAN and DM architectures used to generate the synthetic data. Experimental results demonstrated that our approach achieved more than 97% classification accuracy, outperforming existing state-of-the-art methods. The models obtained in the different levels turn out to be robust to various attacks such as JPEG compression (with different quality factor values) and resize (and others), demonstrating that the framework can be used and applied in real-world contexts (such as the analysis of multimedia data shared in the various social platforms) for support even in forensic investigations in order to counter the illicit use of these powerful and modern generative models. We are able to identify the specific GAN and DM architecture used to generate the image, which is critical in tracking down the source of the deepfake. Our hierarchical multi-level approach to deepfake detection and recognition shows promising results in identifying deepfakes allowing focus on underlying task by improving (about (2% ) on the average) standard multiclass flat detection systems. The proposed method has the potential to enhance the performance of deepfake detection systems, aid in the fight against the spread of fake images, and safeguard the authenticity of digital media.
检测和识别深度伪造是数字时代的一个紧迫问题。在这项研究中,我们首先收集了由九种不同的生成对抗网络(GAN)架构和四种扩散模型(DM)正确生成的原始图像和假图像数据集。该数据集共包含 83,000 张图像,真实数据和深度伪造数据分布均衡。然后,针对不同的深度伪造检测和识别任务,我们提出了一种分层多级方法。在第一层,我们对真实图像和人工智能生成的图像进行了分类。在第二层,我们区分了由 GAN 和 DM 生成的图像。在第三层(由另外两个子层组成),我们识别了用于生成合成数据的特定 GAN 和 DM 架构。实验结果表明,我们的方法达到了 97% 以上的分类准确率,超过了现有的最先进方法。在不同级别中获得的模型对各种攻击(如 JPEG 压缩(使用不同的质量因子值)和调整大小等)具有鲁棒性,这表明该框架可以在现实世界中使用和应用(如分析在各种社交平台上共享的多媒体数据),甚至可以在取证调查中提供支持,以打击非法使用这些强大的现代生成模型的行为。我们能够识别用于生成图像的特定 GAN 和 DM 架构,这对于追踪 deepfake 的来源至关重要。我们的分层多级深层伪造检测和识别方法在识别深层伪造方面取得了可喜的成果,通过改进(平均约为(2%))标准多类平面检测系统,使我们能够专注于底层任务。所提出的方法有望提高深度赝品检测系统的性能,帮助打击假图像的传播,并保护数字媒体的真实性。
{"title":"Mastering Deepfake Detection: A Cutting-Edge Approach to Distinguish GAN and Diffusion-Model Images","authors":"Luca Guarnera, Oliver Giudice, Sebastiano Battiato","doi":"10.1145/3652027","DOIUrl":"https://doi.org/10.1145/3652027","url":null,"abstract":"<p>Detecting and recognizing deepfakes is a pressing issue in the digital age. In this study, we first collected a dataset of pristine images and fake ones properly generated by nine different Generative Adversarial Network (GAN) architectures and four Diffusion Models (DM). The dataset contained a total of 83,000 images, with equal distribution between the real and deepfake data. Then, to address different deepfake detection and recognition tasks, we proposed a hierarchical multi-level approach. At the first level, we classified real images from AI-generated ones. At the second level, we distinguished between images generated by GANs and DMs. At the third level (composed of two additional sub-levels), we recognized the specific GAN and DM architectures used to generate the synthetic data. Experimental results demonstrated that our approach achieved more than 97% classification accuracy, outperforming existing state-of-the-art methods. The models obtained in the different levels turn out to be robust to various attacks such as JPEG compression (with different quality factor values) and resize (and others), demonstrating that the framework can be used and applied in real-world contexts (such as the analysis of multimedia data shared in the various social platforms) for support even in forensic investigations in order to counter the illicit use of these powerful and modern generative models. We are able to identify the specific GAN and DM architecture used to generate the image, which is critical in tracking down the source of the deepfake. Our hierarchical multi-level approach to deepfake detection and recognition shows promising results in identifying deepfakes allowing focus on underlying task by improving (about (2% ) on the average) standard multiclass flat detection systems. The proposed method has the potential to enhance the performance of deepfake detection systems, aid in the fight against the spread of fake images, and safeguard the authenticity of digital media.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"6 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks to enhance CSLR backbones. First, we enhance the visual module, which is particularly sensitive to the challenges posed by limited training samples, from the perspective of consistency. Specifically, since sign languages primarily rely on signers’ facial expressions and hand movements to convey information, we develop a keypoint-guided spatial attention module that directs the visual module to focus on informative regions, thereby ensuring spatial attention consistency. Furthermore, recognizing that the output features of both the visual and sequential modules represent the same sentence, we leverage this prior knowledge to better exploit the power of the backbone. We impose a sentence embedding consistency constraint between the visual and sequential modules, enhancing the representation power of both features. The resulting CSLR model, referred to as consistency-enhanced CSLR, demonstrates superior performance on signer-dependent datasets, where all signers appear during both training and testing. To enhance its robustness for the signer-independent setting, we propose a signer removal module based on feature disentanglement, effectively eliminating signer-specific information from the backbone. To validate the effectiveness of the proposed auxiliary tasks, we conduct extensive ablation studies. Notably, utilizing a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, including PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.
{"title":"Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal","authors":"Ronglai Zuo, Brian Mak","doi":"10.1145/3640815","DOIUrl":"https://doi.org/10.1145/3640815","url":null,"abstract":"<p>Deep-learning-based continuous sign language recognition (CSLR) models typically consist of a visual module, a sequential module, and an alignment module. However, the effectiveness of training such CSLR backbones is hindered by limited training samples, rendering the use of a single connectionist temporal classification loss insufficient. To address this limitation, we propose three auxiliary tasks to enhance CSLR backbones. First, we enhance the visual module, which is particularly sensitive to the challenges posed by limited training samples, from the perspective of consistency. Specifically, since sign languages primarily rely on signers’ facial expressions and hand movements to convey information, we develop a keypoint-guided spatial attention module that directs the visual module to focus on informative regions, thereby ensuring spatial attention consistency. Furthermore, recognizing that the output features of both the visual and sequential modules represent the same sentence, we leverage this prior knowledge to better exploit the power of the backbone. We impose a sentence embedding consistency constraint between the visual and sequential modules, enhancing the representation power of both features. The resulting CSLR model, referred to as consistency-enhanced CSLR, demonstrates superior performance on signer-dependent datasets, where all signers appear during both training and testing. To enhance its robustness for the signer-independent setting, we propose a signer removal module based on feature disentanglement, effectively eliminating signer-specific information from the backbone. To validate the effectiveness of the proposed auxiliary tasks, we conduct extensive ablation studies. Notably, utilizing a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, including PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"53 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, Zhang Bo
Visual tracking is a fundamental task in computer vision with significant practical applications in various domains, including surveillance, security, robotics, and human-computer interaction. However, it may face limitations in visible light data, such as low-light environments, occlusion, and camouflage, which can significantly reduce its accuracy. To cope with these challenges, researchers have explored the potential of combining the visible and infrared modalities to improve tracking performance. By leveraging the complementary strengths of visible and infrared data, RGB-infrared fusion tracking has emerged as a promising approach to address these limitations and improve tracking accuracy in challenging scenarios. In this paper, we present a review on RGB-infrared fusion tracking. Specifically, we categorize existing RGBT tracking methods into four categories based on their underlying architectures, feature representations, and fusion strategies, namely feature decoupling based method, feature selecting based method, collaborative graph tracking method, and traditional fusion method. Furthermore, we provide a critical analysis of their strengths, limitations, representative methods, and future research directions. To further demonstrate the advantages and disadvantages of these methods, we present a review of publicly available RGBT tracking datasets and analyze the main results on public datasets. Moreover,we discuss some limitations in RGBT tracking at present and provide some opportunities and future directions for RGBT visual tracking, such as dataset diversity, unsupervised and weakly supervised applications. In conclusion, our survey aims to serve as a useful resource for researchers and practitioners interested in the emerging field of RGBT tracking, and to promote further progress and innovation in this area.
{"title":"Review and Analysis of RGBT Single Object Tracking Methods: A Fusion Perspective","authors":"ZhiHao Zhang, Jun Wang, Zhuli Zang, Lei Jin, Shengjie Li, Hao Wu, Jian Zhao, Zhang Bo","doi":"10.1145/3651308","DOIUrl":"https://doi.org/10.1145/3651308","url":null,"abstract":"<p>Visual tracking is a fundamental task in computer vision with significant practical applications in various domains, including surveillance, security, robotics, and human-computer interaction. However, it may face limitations in visible light data, such as low-light environments, occlusion, and camouflage, which can significantly reduce its accuracy. To cope with these challenges, researchers have explored the potential of combining the visible and infrared modalities to improve tracking performance. By leveraging the complementary strengths of visible and infrared data, RGB-infrared fusion tracking has emerged as a promising approach to address these limitations and improve tracking accuracy in challenging scenarios. In this paper, we present a review on RGB-infrared fusion tracking. Specifically, we categorize existing RGBT tracking methods into four categories based on their underlying architectures, feature representations, and fusion strategies, namely feature decoupling based method, feature selecting based method, collaborative graph tracking method, and traditional fusion method. Furthermore, we provide a critical analysis of their strengths, limitations, representative methods, and future research directions. To further demonstrate the advantages and disadvantages of these methods, we present a review of publicly available RGBT tracking datasets and analyze the main results on public datasets. Moreover,we discuss some limitations in RGBT tracking at present and provide some opportunities and future directions for RGBT visual tracking, such as dataset diversity, unsupervised and weakly supervised applications. In conclusion, our survey aims to serve as a useful resource for researchers and practitioners interested in the emerging field of RGBT tracking, and to promote further progress and innovation in this area.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video models on federated learning (FL) enable continual learning of the involved models for video tasks on end-user devices while protecting the privacy of end-user data. As a result, the security issues on FL, e.g., the backdoor attacks on FL and their defense have increasingly becoming the domains of extensive research in recent years. The backdoor attacks on FL are a class of poisoning attacks, in which an attacker, as one of the training participants, submits poisoned parameters and thus injects the backdoor into the global model after aggregation. Existing backdoor attacks against videos based on FL only poison RGB frames, which makes that the attack could be easily mitigated by two-stream model neutralization. Therefore, it is a big challenge to manipulate the most advanced two-stream video model with a high success rate by poisoning only a small proportion of training data in the framework of FL. In this paper, a new backdoor attack scheme incorporating the rich spatial and temporal structures of video data is proposed, which injects the backdoor triggers into both the optical flow and RGB frames of video data through multiple rounds of model aggregations. In addition, the adversarial attack is utilized on the RGB frames to further boost the robustness of the attacks. Extensive experiments on real-world datasets verify that our methods outperform the state-of-the-art backdoor attacks and show better performance in terms of stealthiness and persistence.
{"title":"Backdoor Two-Stream Video Models on Federated Learning","authors":"Jing Zhao, Hongwei Yang, Hui He, Jie Peng, Weizhe Zhang, Jiangqun Ni, Arun Kumar Sangaiah, Aniello Castiglione","doi":"10.1145/3651307","DOIUrl":"https://doi.org/10.1145/3651307","url":null,"abstract":"<p>Video models on federated learning (FL) enable continual learning of the involved models for video tasks on end-user devices while protecting the privacy of end-user data. As a result, the security issues on FL, e.g., the backdoor attacks on FL and their defense have increasingly becoming the domains of extensive research in recent years. The backdoor attacks on FL are a class of poisoning attacks, in which an attacker, as one of the training participants, submits poisoned parameters and thus injects the backdoor into the global model after aggregation. Existing backdoor attacks against videos based on FL only poison RGB frames, which makes that the attack could be easily mitigated by two-stream model neutralization. Therefore, it is a big challenge to manipulate the most advanced two-stream video model with a high success rate by poisoning only a small proportion of training data in the framework of FL. In this paper, a new backdoor attack scheme incorporating the rich spatial and temporal structures of video data is proposed, which injects the backdoor triggers into both the optical flow and RGB frames of video data through multiple rounds of model aggregations. In addition, the adversarial attack is utilized on the RGB frames to further boost the robustness of the attacks. Extensive experiments on real-world datasets verify that our methods outperform the state-of-the-art backdoor attacks and show better performance in terms of stealthiness and persistence.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"122 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140072366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carlos Cortés, Irene Viola, Jesús Gutiérrez, Jack Jansen, Shishir Subramanyam, Evangelos Alexiou, Pablo Pérez, Narciso García, Pablo César
Immersive technologies like eXtended Reality (XR) are the next step in videoconferencing. In this context, understanding the effect of delay on communication is crucial. This paper presents the first study on the impact of delay on collaborative tasks using a realistic Social XR system. Specifically, we design an experiment and evaluate the impact of end-to-end delays of 300, 600, 900, 1200, and 1500 ms on the execution of a standardized task involving the collaboration of two remote users that meet in a virtual space and construct block-based shapes. To measure the impact of the delay in this communication scenario, objective and subjective data were collected. As objective data, we measured the time required to execute the tasks and computed conversational characteristics by analysing the recorded audio signals. As subjective data, a questionnaire was prepared and completed by every user to evaluate different factors such as overall quality, perception of delay, annoyance using the system, level of presence, cybersickness, and other subjective factors associated with social interaction. The results show a clear influence of the delay on the perceived quality and a significant negative effect as the delay increases. Specifically, the results indicate that the acceptable threshold for end-to-end delay should not exceed 900 ms. This article, additionally provides guidelines for developing standardized XR tasks for assessing interaction in Social XR environments.
{"title":"Delay threshold for social interaction in volumetric eXtended Reality communication","authors":"Carlos Cortés, Irene Viola, Jesús Gutiérrez, Jack Jansen, Shishir Subramanyam, Evangelos Alexiou, Pablo Pérez, Narciso García, Pablo César","doi":"10.1145/3651164","DOIUrl":"https://doi.org/10.1145/3651164","url":null,"abstract":"<p>Immersive technologies like eXtended Reality (XR) are the next step in videoconferencing. In this context, understanding the effect of delay on communication is crucial. This paper presents the first study on the impact of delay on collaborative tasks using a realistic Social XR system. Specifically, we design an experiment and evaluate the impact of end-to-end delays of 300, 600, 900, 1200, and 1500 ms on the execution of a standardized task involving the collaboration of two remote users that meet in a virtual space and construct block-based shapes. To measure the impact of the delay in this communication scenario, objective and subjective data were collected. As objective data, we measured the time required to execute the tasks and computed conversational characteristics by analysing the recorded audio signals. As subjective data, a questionnaire was prepared and completed by every user to evaluate different factors such as overall quality, perception of delay, annoyance using the system, level of presence, cybersickness, and other subjective factors associated with social interaction. The results show a clear influence of the delay on the perceived quality and a significant negative effect as the delay increases. Specifically, the results indicate that the acceptable threshold for end-to-end delay should not exceed 900 ms. This article, additionally provides guidelines for developing standardized XR tasks for assessing interaction in Social XR environments.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"32 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video super-resolution (VSR) algorithms aim at recovering a temporally consistent high-resolution (HR) video from its corresponding low-resolution (LR) video sequence. Due to the limited bandwidth during video transmission, most available videos on the internet are compressed. Nevertheless, few existing algorithms consider the compression factor in practical applications. In this paper, we propose an enhanced VSR model towards compressed videos, termed as ECVSR, to simultaneously achieve compression artifacts reduction and SR reconstruction end-to-end. ECVSR contains a motion-excited temporal adaption network (METAN) and a multi-frame SR network (SRNet). The METAN takes decoded LR video frames as input and models inter-frame correlations via bidirectional deformable alignment and motion-excited temporal adaption, where temporal differences are calculated as motion prior to excite the motion-sensitive regions of temporal features. In SRNet, cascaded recurrent multi-scale blocks (RMSB) are employed to learn deep spatio-temporal representations from adapted multi-frame features. Then, we build a reconstruction module for spatio-temporal information integration and HR frame reconstruction, which is followed by a detail refinement module for texture and visual quality enhancement. Extensive experimental results on compressed videos demonstrate the superiority of our method for compressed VSR. Code will be available at https://github.com/lifengcs/ECVSR.
视频超分辨率(VSR)算法旨在从相应的低分辨率(LR)视频序列中恢复出时间上一致的高分辨率(HR)视频。由于视频传输过程中带宽有限,互联网上的大多数视频都经过了压缩。然而,现有算法很少考虑实际应用中的压缩因素。在本文中,我们提出了一种针对压缩视频的增强型 VSR 模型(称为 ECVSR),以同时实现压缩伪影的减少和端到端的 SR 重建。ECVSR 包含一个运动激发时间自适应网络(METAN)和一个多帧 SR 网络(SRNet)。METAN 将解码的 LR 视频帧作为输入,并通过双向可变形对齐和运动激发时序自适应建立帧间相关性模型,其中时序差异被计算为运动先验,以激发时序特征的运动敏感区域。在 SRNet 中,采用级联递归多尺度块(RMSB)从适应的多帧特征中学习深度时空表示。然后,我们建立了一个用于时空信息整合和 HR 帧重建的重构模块,接着是一个用于纹理和视觉质量增强的细节细化模块。压缩视频的大量实验结果证明了我们的方法在压缩 VSR 方面的优越性。代码将发布在 https://github.com/lifengcs/ECVSR 网站上。
{"title":"Enhanced Video Super-Resolution Network Towards Compressed Data","authors":"Feng Li, Yixuan Wu, Anqi Li, Huihui Bai, Runmin Cong, Yao Zhao","doi":"10.1145/3651309","DOIUrl":"https://doi.org/10.1145/3651309","url":null,"abstract":"<p>Video super-resolution (VSR) algorithms aim at recovering a temporally consistent high-resolution (HR) video from its corresponding low-resolution (LR) video sequence. Due to the limited bandwidth during video transmission, most available videos on the internet are compressed. Nevertheless, few existing algorithms consider the compression factor in practical applications. In this paper, we propose an enhanced VSR model towards compressed videos, termed as ECVSR, to simultaneously achieve compression artifacts reduction and SR reconstruction end-to-end. ECVSR contains a motion-excited temporal adaption network (METAN) and a multi-frame SR network (SRNet). The METAN takes decoded LR video frames as input and models inter-frame correlations via bidirectional deformable alignment and motion-excited temporal adaption, where temporal differences are calculated as motion prior to excite the motion-sensitive regions of temporal features. In SRNet, cascaded recurrent multi-scale blocks (RMSB) are employed to learn deep spatio-temporal representations from adapted multi-frame features. Then, we build a reconstruction module for spatio-temporal information integration and HR frame reconstruction, which is followed by a detail refinement module for texture and visual quality enhancement. Extensive experimental results on compressed videos demonstrate the superiority of our method for compressed VSR. Code will be available at https://github.com/lifengcs/ECVSR.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"75 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article presents the results of an empirical study that aimed to investigate the influence of various types of audio (spatial and non-spatial) on the user quality of experience (QoE) of and visual attention in 360° videos. The study compared the head pose, eye gaze, pupil dilations, heart rate and subjective responses of 73 users who watched ten 360° videos with different sound configurations. The configurations evaluated were no sound; non-spatial (stereo) audio; and two spatial sound conditions (first and third-order ambisonics). The videos covered various categories and presented both indoor and outdoor scenarios. The subjective responses were analyzed using an ANOVA (Analysis of Variance) to assess mean differences between sound conditions. Data visualization was also employed to enhance the interpretability of the results. The findings reveal diverse viewing patterns, physiological responses, and subjective experiences among users watching 360° videos with different sound conditions. Spatial audio, in particular third-order ambisonics, garnered heightened attention. This is evident in increased pupil dilation and heart rate. Furthermore, the presence of spatial audio led to more diverse head poses when sound sources were distributed across the scene. These findings have important implications for the development of effective techniques for optimizing processing, encoding, distributing, and rendering content in VR and 360° videos with spatialized audio. These insights are also relevant in the creative realms of content design and enhancement. They provide valuable guidance on how spatial audio influences user attention, physiological responses, and overall subjective experiences. Understanding these dynamics can assist content creators and designers in crafting immersive experiences that leverage spatialized audio to captivate users, enhance engagement, and optimize the overall quality of virtual reality and 360° video content. The dataset, scripts used for data collection, ffmpeg commands used for processing the videos and the subjective questionnaire and its statistical analysis are publicly available.
{"title":"A Quality of Experience and Visual Attention Evaluation for 360° videos with non-spatial and spatial audio","authors":"Amit Hirway, Yuansong Qiao, Niall Murray","doi":"10.1145/3650208","DOIUrl":"https://doi.org/10.1145/3650208","url":null,"abstract":"<p>This article presents the results of an empirical study that aimed to investigate the influence of various types of audio (spatial and non-spatial) on the user quality of experience (QoE) of and visual attention in 360° videos. The study compared the head pose, eye gaze, pupil dilations, heart rate and subjective responses of 73 users who watched ten 360° videos with different sound configurations. The configurations evaluated were no sound; non-spatial (stereo) audio; and two spatial sound conditions (first and third-order ambisonics). The videos covered various categories and presented both indoor and outdoor scenarios. The subjective responses were analyzed using an ANOVA (Analysis of Variance) to assess mean differences between sound conditions. Data visualization was also employed to enhance the interpretability of the results. The findings reveal diverse viewing patterns, physiological responses, and subjective experiences among users watching 360° videos with different sound conditions. Spatial audio, in particular third-order ambisonics, garnered heightened attention. This is evident in increased pupil dilation and heart rate. Furthermore, the presence of spatial audio led to more diverse head poses when sound sources were distributed across the scene. These findings have important implications for the development of effective techniques for optimizing processing, encoding, distributing, and rendering content in VR and 360° videos with spatialized audio. These insights are also relevant in the creative realms of content design and enhancement. They provide valuable guidance on how spatial audio influences user attention, physiological responses, and overall subjective experiences. Understanding these dynamics can assist content creators and designers in crafting immersive experiences that leverage spatialized audio to captivate users, enhance engagement, and optimize the overall quality of virtual reality and 360° video content. The dataset, scripts used for data collection, ffmpeg commands used for processing the videos and the subjective questionnaire and its statistical analysis are publicly available.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"43 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140044406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bekir Oguzhan Turkkan, Ting Dai, Adithya Raman, Tevfik Kosar, Changyou Chen, Muhammed Fatih Bulut, Jaroslaw Zola, Daby Sow
Adaptive bitrate (ABR) algorithms play a critical role in video streaming by making optimal bitrate decisions in dynamically changing network conditions to provide a high quality of experience (QoE) for users. However, most existing ABRs suffer from limitations such as predefined rules and incorrect assumptions about streaming parameters. They often prioritize higher bitrates and ignore the corresponding energy footprint, resulting in increased energy consumption, especially for mobile device users. Additionally, most ABR algorithms do not consider perceived quality, leading to suboptimal user experience. This paper proposes a novel ABR scheme called GreenABR+, which utilizes deep reinforcement learning to optimize energy consumption during video streaming while maintaining high user QoE. Unlike existing rule-based ABR algorithms, GreenABR+ makes no assumptions about video settings or the streaming environment. GreenABR+ model works on different video representation sets and can adapt to dynamically changing conditions in a wide range of network scenarios. Our experiments demonstrate that GreenABR+ outperforms state-of-the-art ABR algorithms by saving up to 57% in streaming energy consumption and 57% in data consumption while providing up to 25% more perceptual QoE due to up to 87% less rebuffering time and near-zero capacity violations. The generalization and dynamic adaptability make GreenABR+ a flexible solution for energy-efficient ABR optimization.
自适应比特率(ABR)算法在视频流中发挥着至关重要的作用,它能在动态变化的网络条件下做出最佳比特率决策,为用户提供高质量的体验(QoE)。然而,大多数现有 ABR 算法都存在一些局限性,如预定义规则和对流媒体参数的不正确假设。它们通常优先考虑较高的比特率,而忽略了相应的能耗,导致能耗增加,尤其是对移动设备用户而言。此外,大多数 ABR 算法不考虑感知质量,导致用户体验不佳。本文提出了一种名为 GreenABR+ 的新型 ABR 方案,它利用深度强化学习来优化视频流期间的能耗,同时保持较高的用户 QoE。与现有的基于规则的 ABR 算法不同,GreenABR+ 不对视频设置或流媒体环境做任何假设。GreenABR+ 模型适用于不同的视频表示集,并能适应各种网络场景中动态变化的条件。我们的实验证明,GreenABR+ 优于最先进的 ABR 算法,可节省高达 57% 的流能耗和 57% 的数据消耗,同时由于减少了高达 87% 的回弹时间和近乎零的容量违规,可提供高达 25% 的感知 QoE。通用性和动态适应性使 GreenABR+ 成为高能效 ABR 优化的灵活解决方案。
{"title":"GreenABR+: Generalized Energy-Aware Adaptive Bitrate Streaming","authors":"Bekir Oguzhan Turkkan, Ting Dai, Adithya Raman, Tevfik Kosar, Changyou Chen, Muhammed Fatih Bulut, Jaroslaw Zola, Daby Sow","doi":"10.1145/3649898","DOIUrl":"https://doi.org/10.1145/3649898","url":null,"abstract":"<p>Adaptive bitrate (ABR) algorithms play a critical role in video streaming by making optimal bitrate decisions in dynamically changing network conditions to provide a high quality of experience (QoE) for users. However, most existing ABRs suffer from limitations such as predefined rules and incorrect assumptions about streaming parameters. They often prioritize higher bitrates and ignore the corresponding energy footprint, resulting in increased energy consumption, especially for mobile device users. Additionally, most ABR algorithms do not consider perceived quality, leading to suboptimal user experience. This paper proposes a novel ABR scheme called GreenABR+, which utilizes deep reinforcement learning to optimize energy consumption during video streaming while maintaining high user QoE. Unlike existing rule-based ABR algorithms, GreenABR+ makes no assumptions about video settings or the streaming environment. GreenABR+ model works on different video representation sets and can adapt to dynamically changing conditions in a wide range of network scenarios. Our experiments demonstrate that GreenABR+ outperforms state-of-the-art ABR algorithms by saving up to 57% in streaming energy consumption and 57% in data consumption while providing up to 25% more perceptual QoE due to up to 87% less rebuffering time and near-zero capacity violations. The generalization and dynamic adaptability make GreenABR+ a flexible solution for energy-efficient ABR optimization.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"237 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}