ACM Transactions on Multimedia Computing Communications and Applications最新文献_第8页

Text-Guided Synthesis of Masked Face Images 文本引导下的屏蔽人脸图像合成

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-30 DOI: 10.1145/3654667

Anjali T, Masilamani V

The COVID-19 pandemic has made us all understand that wearing a face mask protects us from the spread of respiratory viruses. The face authentication systems, which are trained on the basis of facial key points such as the eyes, nose, and mouth, found it difficult to identify the person when the majority of the face is covered by the face mask. Removing the mask for authentication will cause the infection to spread. The possible solutions are: (a) to train the face recognition systems to identify the person with the upper face features (b) Reconstruct the complete face of the person with a generative model. (c) train the model with a dataset of the masked faces of the people. In this paper, we explore the scope of generative models for image synthesis. We used stable diffusion to generate masked face images of popular celebrities on various text prompts. A realistic dataset of 15K masked face images of 100 celebrities is generated and is called the Realistic Synthetic Masked Face Dataset (RSMFD). The model and the generated dataset will be made public so that researchers can augment the dataset. According to our knowledge, this is the largest masked face recognition dataset with realistic images. The generated images were tested on popular deep face recognition models and achieved significant results. The dataset is also trained and tested on some of the famous image classification models, and the results are competitive. The dataset is available on this link:- https://drive.google.com/drive/folders/1yetcgUOL1TOP4rod1geGsOkIrIJHtcEw?usp=sharing

COVID-19 大流行让我们明白，戴口罩可以防止呼吸道病毒传播。人脸识别系统是根据眼睛、鼻子和嘴巴等面部关键部位进行训练的，但当面部大部分被口罩遮住时，系统就很难识别人的身份。摘下面具进行身份验证会导致感染扩散。可能的解决办法有(a) 对人脸识别系统进行训练，使其能够根据人脸上部特征识别人的身份 (b) 利用生成模型重建人的完整面部。(c) 使用蒙面人脸数据集训练模型。在本文中，我们探索了生成模型在图像合成中的应用范围。我们利用稳定扩散生成了各种文本提示下的流行名人的面具人脸图像。生成的真实数据集包含 100 位名人的 15K 张面具人脸图像，被称为真实合成面具人脸数据集（RSMFD）。模型和生成的数据集将公开，以便研究人员扩充数据集。据我们所知，这是最大的具有真实图像的面具人脸识别数据集。生成的图像在流行的深度人脸识别模型上进行了测试，取得了显著的效果。该数据集还在一些著名的图像分类模型上进行了训练和测试，结果也很有竞争力。该数据集可从以下链接获取：- https://drive.google.com/drive/folders/1yetcgUOL1TOP4rod1geGsOkIrIJHtcEw?usp=sharing

{"title":"Text-Guided Synthesis of Masked Face Images","authors":"Anjali T, Masilamani V","doi":"10.1145/3654667","DOIUrl":"https://doi.org/10.1145/3654667","url":null,"abstract":"The COVID-19 pandemic has made us all understand that wearing a face mask protects us from the spread of respiratory viruses. The face authentication systems, which are trained on the basis of facial key points such as the eyes, nose, and mouth, found it difficult to identify the person when the majority of the face is covered by the face mask. Removing the mask for authentication will cause the infection to spread. The possible solutions are: (a) to train the face recognition systems to identify the person with the upper face features (b) Reconstruct the complete face of the person with a generative model. (c) train the model with a dataset of the masked faces of the people. In this paper, we explore the scope of generative models for image synthesis. We used stable diffusion to generate masked face images of popular celebrities on various text prompts. A realistic dataset of 15K masked face images of 100 celebrities is generated and is called the Realistic Synthetic Masked Face Dataset (RSMFD). The model and the generated dataset will be made public so that researchers can augment the dataset. According to our knowledge, this is the largest masked face recognition dataset with realistic images. The generated images were tested on popular deep face recognition models and achieved significant results. The dataset is also trained and tested on some of the famous image classification models, and the results are competitive. The dataset is available on this link:- https://drive.google.com/drive/folders/1yetcgUOL1TOP4rod1geGsOkIrIJHtcEw?usp=sharing\u0000","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"1 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Paying Attention to Vehicles: A Systematic Review on Transformer-Based Vehicle Re-Identification 关注车辆：基于变压器的车辆再识别系统综述

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-30 DOI: 10.1145/3655623

Yan Qian, Johan Barthélemy, Bo Du, Jun Shen

Vehicle re-identification (v-reID) is a crucial and challenging task in the intelligent transportation systems (ITS). While vehicle re-identification plays a role in analysing traffic behaviour, criminal investigation, or automatic toll collection, it is also a key component for the construction of smart cities. With the recent introduction of transformer models and their rapid development in computer vision, vehicle re-identification has also made significant progress in performance and development over 2021-2023. This bite-sized review is the first to summarize existing works in vehicle re-identification using pure transformer models and examine their capabilities. We introduce the various applications and challenges, different datasets, evaluation strategies and loss functions in v-reID. A comparison between existing state-of-the-art methods based on different research areas is then provided. Finally, we discuss possible future research directions and provide a checklist on how to implement a v-reID model. This checklist is useful for an interested researcher or practitioner who is starting their work in this field, and also for anyone who seeks an insight into how to implement an AI model in computer vision using v-reID.

在智能交通系统（ITS）中，车辆重新识别（v-reID）是一项至关重要且极具挑战性的任务。车辆再识别不仅在交通行为分析、犯罪调查或自动收费方面发挥作用，也是智能城市建设的关键组成部分。随着最近变压器模型的引入及其在计算机视觉领域的快速发展，车辆再识别技术在 2021-2023 年期间也在性能和发展方面取得了重大进展。这篇短小精悍的综述首次总结了使用纯变压器模型进行车辆再识别的现有工作，并考察了其能力。我们介绍了 v-reID 的各种应用和挑战、不同的数据集、评估策略和损失函数。然后对基于不同研究领域的现有先进方法进行比较。最后，我们讨论了未来可能的研究方向，并提供了一份如何实施 v-reID 模型的清单。这份清单对刚开始从事这一领域工作的研究人员或从业人员，以及任何想深入了解如何使用 v-reID 在计算机视觉中实施人工智能模型的人都很有用。

{"title":"Paying Attention to Vehicles: A Systematic Review on Transformer-Based Vehicle Re-Identification","authors":"Yan Qian, Johan Barthélemy, Bo Du, Jun Shen","doi":"10.1145/3655623","DOIUrl":"https://doi.org/10.1145/3655623","url":null,"abstract":"Vehicle re-identification (v-reID) is a crucial and challenging task in the intelligent transportation systems (ITS). While vehicle re-identification plays a role in analysing traffic behaviour, criminal investigation, or automatic toll collection, it is also a key component for the construction of smart cities. With the recent introduction of transformer models and their rapid development in computer vision, vehicle re-identification has also made significant progress in performance and development over 2021-2023. This bite-sized review is the first to summarize existing works in vehicle re-identification using pure transformer models and examine their capabilities. We introduce the various applications and challenges, different datasets, evaluation strategies and loss functions in v-reID. A comparison between existing state-of-the-art methods based on different research areas is then provided. Finally, we discuss possible future research directions and provide a checklist on how to implement a v-reID model. This checklist is useful for an interested researcher or practitioner who is starting their work in this field, and also for anyone who seeks an insight into how to implement an AI model in computer vision using v-reID.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"18 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effective Video Summarization by Extracting Parameter-free Motion Attention 通过提取无参数运动注意力实现有效的视频总结

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-30 DOI: 10.1145/3654670

Tingting Han, Quan Zhou, Jun Yu, Zhou Yu, Jianhui Zhang, Sicheng Zhao

Video summarization remains a challenging task despite increasing research efforts. Traditional methods focus solely on long-range temporal modeling of video frames, overlooking important local motion information which can not be captured by frame-level video representations. In this paper, we propose the Parameter-free Motion Attention Module (PMAM) to exploit the crucial motion clues potentially contained in adjacent video frames, using a multi-head attention architecture. The PMAM requires no additional training for model parameters, leading to an efficient and effective understanding of video dynamics. Moreover, we introduce the Multi-feature Motion Attention Network (MMAN), integrating the parameter-free motion attention module with local and global multi-head attention based on object-centric and scene-centric video representations. The synergistic combination of local motion information, extracted by the proposed PMAM, with long-range interactions modeled by the local and global multi-head attention mechanism, can significantly enhance the performance of video summarization. Extensive experimental results on the benchmark datasets, SumMe and TVSum, demonstrate that the proposed MMAN outperforms other state-of-the-art methods, resulting in remarkable performance gains.

尽管研究力度不断加大，但视频摘要仍是一项极具挑战性的任务。传统方法只关注视频帧的远距离时间建模，忽略了帧级视频表示法无法捕捉的重要局部运动信息。在本文中，我们提出了无参数运动注意力模块（PMAM），利用多头注意力架构来利用相邻视频帧中可能包含的重要运动线索。PMAM 不需要额外的模型参数训练，就能高效地理解视频动态。此外，我们还引入了多特征运动注意力网络（MMAN），将免参数运动注意力模块与基于以对象为中心和以场景为中心的视频表征的局部和全局多头注意力整合在一起。由所提出的 PMAM 提取的局部运动信息与由局部和全局多头注意力机制建模的长程交互作用的协同组合，可以显著提高视频摘要的性能。在基准数据集 SumMe 和 TVSum 上的大量实验结果表明，所提出的 MMAN 优于其他最先进的方法，从而带来了显著的性能提升。

{"title":"Effective Video Summarization by Extracting Parameter-free Motion Attention","authors":"Tingting Han, Quan Zhou, Jun Yu, Zhou Yu, Jianhui Zhang, Sicheng Zhao","doi":"10.1145/3654670","DOIUrl":"https://doi.org/10.1145/3654670","url":null,"abstract":"Video summarization remains a challenging task despite increasing research efforts. Traditional methods focus solely on long-range temporal modeling of video frames, overlooking important local motion information which can not be captured by frame-level video representations. In this paper, we propose the Parameter-free Motion Attention Module (PMAM) to exploit the crucial motion clues potentially contained in adjacent video frames, using a multi-head attention architecture. The PMAM requires no additional training for model parameters, leading to an efficient and effective understanding of video dynamics. Moreover, we introduce the Multi-feature Motion Attention Network (MMAN), integrating the parameter-free motion attention module with local and global multi-head attention based on object-centric and scene-centric video representations. The synergistic combination of local motion information, extracted by the proposed PMAM, with long-range interactions modeled by the local and global multi-head attention mechanism, can significantly enhance the performance of video summarization. Extensive experimental results on the benchmark datasets, SumMe and TVSum, demonstrate that the proposed MMAN outperforms other state-of-the-art methods, resulting in remarkable performance gains.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2015 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

4D Facial Expression Diffusion Model 4D 面部表情扩散模型

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-28 DOI: 10.1145/3653455

Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo

Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models will be made available upon acceptance.

面部表情生成是角色动画中最具挑战性和长期追求的方面之一，有许多有趣的应用。这项极具挑战性的任务在很大程度上一直依赖于数字工艺师，目前仍有待探索。在本文中，我们介绍了一种生成三维面部表情序列（即 4D 面部）的生成框架，该框架可根据不同的输入条件生成任意三维面部网格的动画。它由两项任务组成：(1) 通过一组三维地标序列学习训练生成模型；(2) 根据生成的地标序列生成输入面部网格的三维网格序列。生成模型基于去噪扩散概率模型（DDPM），该模型在其他领域的生成任务中取得了显著的成功。虽然它可以无条件地进行训练，但其反向过程仍然可以受到各种条件信号的制约。这样，我们就可以通过使用表情标签、文本、部分序列或简单的面部几何图形，高效地开发涉及各种条件生成的下游任务。为了获得完整的网格变形，我们开发了一种地标引导的编码器-解码器，用于在给定的面部网格上应用嵌入地标的几何变形。实验表明，我们的模型已学会仅从相对较小的数据集生成逼真、高质量的表情，比最先进的方法更胜一筹。视频和与其他方法的定性比较见 https://github.com/ZOUKaifeng/4DFM。代码和模型将在接受后提供。

{"title":"4D Facial Expression Diffusion Model","authors":"Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo","doi":"10.1145/3653455","DOIUrl":"https://doi.org/10.1145/3653455","url":null,"abstract":"Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models will be made available upon acceptance.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"53 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140315163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ISF-GAN: Imagine, Select, and Fuse with GPT-Based Text Enrichment for Text-to-Image Synthesis ISF-GAN：使用基于 GPT 的文本丰富技术进行文本到图像合成的想象、选择和融合

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-28 DOI: 10.1145/3650033

Yefei Sheng, Ming Tao, Jie Wang, Bing-Kun Bao

Text-to-Image synthesis aims to generate an accurate and semantically consistent image from a given text description. However, it is difficult for existing generative methods to generate semantically complete images from a single piece of text. Some works try to expand the input text to multiple captions via retrieving similar descriptions of the input text from the training set, but still fail to fill in missing image semantics. In this paper, we propose a GAN-based approach to Imagine, Select, and Fuse for Text-to-Image synthesis, named ISF-GAN. The proposed ISF-GAN contains Imagine Stage and Select and Fuse Stage to solve the above problems. First, the Imagine Stage proposes a text completion and enrichment module. This module guides a GPT-based model to enrich the text expression beyond the original dataset. Second, the Select and Fuse Stage selects qualified text descriptions, and then introduces a cross-modal attentional mechanism to interact these different sentences with the image features at different scales. In short, our proposed model enriches the input text information for completing missing semantics and introduces a cross-modal attentional mechanism to maximize the utilization of enriched text information to generate semantically consistent images. Experimental results on CUB, Oxford-102, and CelebA-HQ datasets prove the effectiveness and superiority of the proposed network. Code is available at https://github.com/Feilingg/ISF-GAN.

文本到图像的合成旨在根据给定的文本描述生成准确且语义一致的图像。然而，现有的生成方法很难从单一文本生成语义完整的图像。一些作品试图通过从训练集中检索输入文本的相似描述，将输入文本扩展为多个标题，但仍无法填补缺失的图像语义。在本文中，我们提出了一种基于 GAN 的 "想象"、"选择 "和 "融合 "方法，用于文本到图像的合成，命名为 ISF-GAN。所提出的 ISF-GAN 包含想象阶段（Imagine Stage）和选择与融合阶段（Select and Fuse Stage），以解决上述问题。首先，"想象阶段 "提出了一个文本补全和丰富模块。该模块引导基于 GPT 的模型来丰富原始数据集之外的文本表达。其次，选择和融合阶段选择合格的文本描述，然后引入跨模态注意机制，将这些不同的句子与不同尺度的图像特征进行交互。简而言之，我们提出的模型丰富了输入文本信息，补全了缺失的语义，并引入了跨模态注意机制，最大限度地利用丰富的文本信息生成语义一致的图像。在 CUB、Oxford-102 和 CelebA-HQ 数据集上的实验结果证明了所提网络的有效性和优越性。代码见 https://github.com/Feilingg/ISF-GAN。

{"title":"ISF-GAN: Imagine, Select, and Fuse with GPT-Based Text Enrichment for Text-to-Image Synthesis","authors":"Yefei Sheng, Ming Tao, Jie Wang, Bing-Kun Bao","doi":"10.1145/3650033","DOIUrl":"https://doi.org/10.1145/3650033","url":null,"abstract":"Text-to-Image synthesis aims to generate an accurate and semantically consistent image from a given text description. However, it is difficult for existing generative methods to generate semantically complete images from a single piece of text. Some works try to expand the input text to multiple captions via retrieving similar descriptions of the input text from the training set, but still fail to fill in missing image semantics. In this paper, we propose a GAN-based approach to Imagine, Select, and Fuse for Text-to-Image synthesis, named ISF-GAN. The proposed ISF-GAN contains Imagine Stage and Select and Fuse Stage to solve the above problems. First, the Imagine Stage proposes a text completion and enrichment module. This module guides a GPT-based model to enrich the text expression beyond the original dataset. Second, the Select and Fuse Stage selects qualified text descriptions, and then introduces a cross-modal attentional mechanism to interact these different sentences with the image features at different scales. In short, our proposed model enriches the input text information for completing missing semantics and introduces a cross-modal attentional mechanism to maximize the utilization of enriched text information to generate semantically consistent images. Experimental results on CUB, Oxford-102, and CelebA-HQ datasets prove the effectiveness and superiority of the proposed network. Code is available at https://github.com/Feilingg/ISF-GAN.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"14 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140315401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Unified Framework for Jointly Compressing Visual and Semantic Data 联合压缩视觉和语义数据的统一框架

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-28 DOI: 10.1145/3654800

Shizhan Liu, Weiyao Lin, Yihang Chen, Yufeng Zhang, Wenrui Dai, John See, Hongkai Xiong

The rapid advancement of multimedia and imaging technologies has resulted in increasingly diverse visual and semantic data. A large range of applications such as remote-assisted driving requires the amalgamated storage and transmission of various visual and semantic data. However, existing works suffer from the limitation of insufficiently exploiting the redundancy between different types of data. In this paper, we propose a unified framework to jointly compress a diverse spectrum of visual and semantic data, including images, point clouds, segmentation maps, object attributes and relations. We develop a unifying process that embeds the representations of these data into a joint embedding graph according to their categories, which enables flexible handling of joint compression tasks for various visual and semantic data. To fully leverage the redundancy between different data types, we further introduce an embedding-based adaptive joint encoding process and a Semantic Adaptation Module to efficiently encode diverse data based on the learned embeddings in the joint embedding graph. Experiments on the Cityscapes, MSCOCO, and KITTI datasets demonstrate the superiority of our framework, highlighting promising steps toward scalable multimedia processing.

多媒体和成像技术的飞速发展导致视觉和语义数据日益多样化。遥控辅助驾驶等大量应用需要综合存储和传输各种视觉和语义数据。然而，现有的工作存在着对不同类型数据之间的冗余利用不足的局限性。在本文中，我们提出了一个统一的框架，用于联合压缩各种视觉和语义数据，包括图像、点云、分割图、对象属性和关系。我们开发了一种统一的流程，将这些数据的表示按照其类别嵌入到一个联合嵌入图中，从而可以灵活处理各种视觉和语义数据的联合压缩任务。为了充分利用不同数据类型之间的冗余，我们进一步引入了基于嵌入的自适应联合编码流程和语义自适应模块，以便根据联合嵌入图中学习到的嵌入对不同数据进行高效编码。在 Cityscapes、MSCOCO 和 KITTI 数据集上进行的实验证明了我们的框架的优越性，凸显了实现可扩展多媒体处理的前景。

{"title":"A Unified Framework for Jointly Compressing Visual and Semantic Data","authors":"Shizhan Liu, Weiyao Lin, Yihang Chen, Yufeng Zhang, Wenrui Dai, John See, Hongkai Xiong","doi":"10.1145/3654800","DOIUrl":"https://doi.org/10.1145/3654800","url":null,"abstract":"The rapid advancement of multimedia and imaging technologies has resulted in increasingly diverse visual and semantic data. A large range of applications such as remote-assisted driving requires the amalgamated storage and transmission of various visual and semantic data. However, existing works suffer from the limitation of insufficiently exploiting the redundancy between different types of data. In this paper, we propose a unified framework to jointly compress a diverse spectrum of visual and semantic data, including images, point clouds, segmentation maps, object attributes and relations. We develop a unifying process that embeds the representations of these data into a joint embedding graph according to their categories, which enables flexible handling of joint compression tasks for various visual and semantic data. To fully leverage the redundancy between different data types, we further introduce an embedding-based adaptive joint encoding process and a Semantic Adaptation Module to efficiently encode diverse data based on the learned embeddings in the joint embedding graph. Experiments on the Cityscapes, MSCOCO, and KITTI datasets demonstrate the superiority of our framework, highlighting promising steps toward scalable multimedia processing.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"197 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140315162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-time Attentive Dilated U-Net for Extremely Dark Image Enhancement 用于极暗图像增强的实时注意力稀释 U-Net

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-27 DOI: 10.1145/3654668

Junjian Huang, Hao Ren, Shulin Liu, Yong Liu, Chuanlu Lv, Jiawen Lu, Changyong Xie, Hong Lu

Images taken under low-light conditions suffer from poor visibility, color distortion and graininess, all of which degrade the image quality and hamper the performance of downstream vision tasks, such as object detection and instance segmentation in the field of autonomous driving, making low-light enhancement an indispensable basic component of high-level visual tasks. Low-light enhancement aims to mitigate these issues, and has garnered extensive attention and research over several decades. The primary challenge in low-light image enhancement arises from the low signal-to-noise ratio (SNR) caused by insufficient lighting. This challenge becomes even more pronounced in near-zero lux conditions, where noise overwhelms the available image information. Both traditional image signal processing (ISP) pipeline and conventional low-light image enhancement methods struggle in such scenarios. Recently, deep neural networks have been used to address this challenge. These networks take unmodified RAW images as input and produce the enhanced sRGB images, forming a deep learning-based ISP pipeline. However, most of these networks are computationally expensive and thus far from practical use. In this paper, we propose a lightweight model called attentive dilated U-Net (ADU-Net) to tackle this issue. Our model incorporates several innovative designs, including an asymmetric U-shape architecture, dilated residual modules (DRMs) for feature extraction, and attentive fusion modules (AFMs) for feature fusion. The DRMs provide strong representative capability while the AFMs effectively leverage low-level texture information and high-level semantic information within the network. Both modules employ a lightweight design but offer significant performance gains. Extensive experiments demonstrate our method is highly-effective, achieving an excellent balance between image quality and computational complexity, i.e., taking less than 4ms for a high-definition 4K image on a single GTX 1080Ti GPU and yet maintaining competitive visual quality. Furthermore, our method exhibits pleasing scalability and generalizability, highlighting its potential for widespread applicability.

在低照度条件下拍摄的图像存在能见度低、色彩失真和颗粒感等问题，所有这些问题都会降低图像质量，妨碍下游视觉任务的执行，例如自动驾驶领域的物体检测和实例分割，因此低照度增强是高级视觉任务不可或缺的基本组成部分。弱光增强旨在缓解这些问题，几十年来已引起广泛关注和研究。弱光图像增强的主要挑战来自照明不足造成的低信噪比（SNR）。在接近零勒克斯的条件下，这一挑战变得更加突出，因为噪声会淹没可用的图像信息。在这种情况下，传统的图像信号处理（ISP）管道和传统的低照度图像增强方法都显得力不从心。最近，深度神经网络被用来应对这一挑战。这些网络将未修改的 RAW 图像作为输入，生成增强的 sRGB 图像，形成了基于深度学习的 ISP 管道。然而，这些网络大多计算成本高昂，因此远未得到实际应用。在本文中，我们提出了一种名为entive dilated U-Net（ADU-Net）的轻量级模型来解决这一问题。我们的模型采用了多项创新设计，包括非对称 U 型架构、用于特征提取的稀释残差模块（DRM）和用于特征融合的殷勤融合模块（AFM）。DRM 具有很强的代表性，而 AFM 则能有效利用网络中的低级纹理信息和高级语义信息。这两个模块都采用了轻量级设计，但性能提升显著。广泛的实验证明，我们的方法非常有效，在图像质量和计算复杂度之间实现了极佳的平衡，即在单个 GTX 1080Ti GPU 上处理高清 4K 图像的时间小于 4 毫秒，同时还能保持极具竞争力的视觉质量。此外，我们的方法还表现出令人满意的可扩展性和通用性，突显了其广泛应用的潜力。

{"title":"Real-time Attentive Dilated U-Net for Extremely Dark Image Enhancement","authors":"Junjian Huang, Hao Ren, Shulin Liu, Yong Liu, Chuanlu Lv, Jiawen Lu, Changyong Xie, Hong Lu","doi":"10.1145/3654668","DOIUrl":"https://doi.org/10.1145/3654668","url":null,"abstract":"Images taken under low-light conditions suffer from poor visibility, color distortion and graininess, all of which degrade the image quality and hamper the performance of downstream vision tasks, such as object detection and instance segmentation in the field of autonomous driving, making low-light enhancement an indispensable basic component of high-level visual tasks. Low-light enhancement aims to mitigate these issues, and has garnered extensive attention and research over several decades. The primary challenge in low-light image enhancement arises from the low signal-to-noise ratio (SNR) caused by insufficient lighting. This challenge becomes even more pronounced in near-zero lux conditions, where noise overwhelms the available image information. Both traditional image signal processing (ISP) pipeline and conventional low-light image enhancement methods struggle in such scenarios. Recently, deep neural networks have been used to address this challenge. These networks take unmodified RAW images as input and produce the enhanced sRGB images, forming a deep learning-based ISP pipeline. However, most of these networks are computationally expensive and thus far from practical use. In this paper, we propose a lightweight model called attentive dilated U-Net (ADU-Net) to tackle this issue. Our model incorporates several innovative designs, including an asymmetric U-shape architecture, dilated residual modules (DRMs) for feature extraction, and attentive fusion modules (AFMs) for feature fusion. The DRMs provide strong representative capability while the AFMs effectively leverage low-level texture information and high-level semantic information within the network. Both modules employ a lightweight design but offer significant performance gains. Extensive experiments demonstrate our method is highly-effective, achieving an excellent balance between image quality and computational complexity, i.e., taking less than 4ms for a high-definition 4K image on a single GTX 1080Ti GPU and yet maintaining competitive visual quality. Furthermore, our method exhibits pleasing scalability and generalizability, highlighting its potential for widespread applicability.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"52 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140297581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discriminative Segment Focus Network for Fine-grained Video Action Recognition 用于细粒度视频动作识别的判别分段聚焦网络

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-26 DOI: 10.1145/3654671

Baoli Sun, Xinchen Ye, Tiantian Yan, Zhihui Wang, Haojie Li, Zhiyong Wang

Fine-grained video action recognition aims to identify minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.

细粒度视频动作识别的目的是识别细类动作之间的微小差异和区别性差异。虽然最近提出了许多动作识别方法，以更好地模拟时空表征，但如何模拟具有区分性的原子动作之间的相互作用，从而有效地描述类间和类内的变化却被忽视了，而这对于理解细粒度动作至关重要。在这项工作中，我们设计了一种判别片段聚焦网络（Discriminative Segment Focus Network，DSFNet）来挖掘片段相关性的可判别性，并定位与判别动作相关的片段，从而实现细粒度视频动作识别。首先，我们提出了分层相关性推理（HCR）模块，该模块在多个时间尺度上明确建立不同片段之间的相关性，并通过利用与其他片段的相关性来增强每个片段。其次，设计了一个分辨片段聚焦（DSF）模块，通过一致性约束强制给定片段的可分辨性和分类置信度之间的一致性，从 HCR 的增强表征中定位出与行动最相关的片段。最后，这些本地化的片段表示与整个视频的全局动作表示相结合，以提高最终识别率。在两个细粒度动作识别数据集（即 FineGym 和 Diving48）和两个动作识别数据集（即 Kinetics400 和 Something-Something）上的大量实验结果表明，与最先进的方法相比，我们的方法非常有效。

{"title":"Discriminative Segment Focus Network for Fine-grained Video Action Recognition","authors":"Baoli Sun, Xinchen Ye, Tiantian Yan, Zhihui Wang, Haojie Li, Zhiyong Wang","doi":"10.1145/3654671","DOIUrl":"https://doi.org/10.1145/3654671","url":null,"abstract":"Fine-grained video action recognition aims to identify minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"55 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140297509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection 用于自监督视频场景边界检测的时态场景蒙太奇

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-26 DOI: 10.1145/3654669

Jiawei Tan, Pingan Yang, Lu Chen, Hongxing Wang

Once a video sequence is organized as basic shot units, it is of great interest to temporally link shots into semantic-compact scene segments to facilitate long video understanding. However, it still challenges existing video scene boundary detection methods to handle various visual semantics and complex shot relations in video scenes. We proposed a novel self-supervised learning method, Video Scene Montage for Boundary Detection (VSMBD), to extract rich shot semantics and learn shot relations using unlabeled videos. More specifically, we present Video Scene Montage (VSM) to synthesize reliable pseudo scene boundaries, which learns task-related semantic relations between shots in a self-supervised manner. To lay a solid foundation for modeling semantic relations between shots, we decouple visual semantics of shots into foreground and background. Instead of costly learning from scratch as in most previous self-supervised learning methods, we build our model upon large-scale pre-trained visual encoders to extract the foreground and background features. Experimental results demonstrate VSMBD trains a model with strong capability in capturing shot relations, surpassing previous methods by significant margins. The code is available at https://github.com/mini-mind/VSMBD.

一旦视频序列被组织为基本镜头单元，将镜头在时间上连接成语义紧凑的场景片段以促进对长视频的理解就变得非常重要。然而，如何处理视频场景中的各种视觉语义和复杂的镜头关系，仍然是现有视频场景边界检测方法面临的挑战。我们提出了一种新颖的自监督学习方法--视频场景蒙太奇边界检测（VSMBD），利用无标记视频提取丰富的镜头语义并学习镜头关系。更具体地说，我们利用视频场景蒙太奇（VSM）合成可靠的伪场景边界，以自我监督的方式学习镜头之间与任务相关的语义关系。为了给镜头之间的语义关系建模奠定坚实的基础，我们将镜头的视觉语义分解为前景和背景。我们不再像之前的大多数自监督学习方法那样从头开始花费高昂的学习成本，而是在大规模预训练视觉编码器的基础上建立模型，以提取前景和背景特征。实验结果表明，VSMBD 训练出的模型在捕捉镜头关系方面具有很强的能力，大大超越了之前的方法。代码见 https://github.com/mini-mind/VSMBD。

{"title":"Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection","authors":"Jiawei Tan, Pingan Yang, Lu Chen, Hongxing Wang","doi":"10.1145/3654669","DOIUrl":"https://doi.org/10.1145/3654669","url":null,"abstract":"Once a video sequence is organized as basic shot units, it is of great interest to temporally link shots into semantic-compact scene segments to facilitate long video understanding. However, it still challenges existing video scene boundary detection methods to handle various visual semantics and complex shot relations in video scenes. We proposed a novel self-supervised learning method, Video Scene Montage for Boundary Detection (VSMBD), to extract rich shot semantics and learn shot relations using unlabeled videos. More specifically, we present Video Scene Montage (VSM) to synthesize reliable pseudo scene boundaries, which learns task-related semantic relations between shots in a self-supervised manner. To lay a solid foundation for modeling semantic relations between shots, we decouple visual semantics of shots into foreground and background. Instead of costly learning from scratch as in most previous self-supervised learning methods, we build our model upon large-scale pre-trained visual encoders to extract the foreground and background features. Experimental results demonstrate VSMBD trains a model with strong capability in capturing shot relations, surpassing previous methods by significant margins. The code is available at https://github.com/mini-mind/VSMBD.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"22 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140297698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Brain Tumor Segmentation with Lightweight Separable Spatial Convolutional Network 利用轻量级可分离空间卷积网络进行高效脑肿瘤分割

IF 5.1 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Multimedia Computing Communications and Applications

Pub Date : 2024-03-23 DOI: 10.1145/3653715

Hao Zhang, Meng Liu, Yuan Qi, Yang Ning, Shunbo Hu, Liqiang Nie, Wenyin Zhang

Accurate and automated segmentation of lesions in brain MRI scans is crucial in diagnostics and treatment planning. Despite the significant achievements of existing approaches, they often require substantial computational resources and fail to fully exploit the synergy between low-level and high-level features. To address these challenges, we introduce the Separable Spatial Convolutional Network (SSCN), an innovative model that refines the U-Net architecture to achieve efficient brain tumor segmentation with minimal computational cost. SSCN integrates the PocketNet paradigm and replaces standard convolutions with depthwise separable convolutions, resulting in a significant reduction in parameters and computational load. Additionally, our feature complementary module enhances the interaction between features across the encoder-decoder structure, facilitating the integration of multi-scale features while maintaining low computational demands. The model also incorporates a separable spatial attention mechanism, enhancing its capability to discern spatial details. Empirical validations on standard datasets demonstrate the effectiveness of our proposed model, especially in segmenting small and medium-sized tumors, with only 0.27M parameters and 3.68GFlops. Our code is available at https://github.com/zzpr/SSCN.

对脑部磁共振成像扫描中的病变进行准确的自动分割对于诊断和治疗规划至关重要。尽管现有的方法取得了巨大的成就，但它们往往需要大量的计算资源，而且无法充分利用低层次和高层次特征之间的协同作用。为了应对这些挑战，我们引入了可分离空间卷积网络（SSCN），这是一种创新模型，它改进了 U-Net 架构，以最小的计算成本实现了高效的脑肿瘤分割。SSCN 整合了 PocketNet 范式，用深度可分离卷积取代了标准卷积，从而显著降低了参数和计算负荷。此外，我们的特征互补模块增强了整个编码器-解码器结构中特征之间的相互作用，在保持低计算需求的同时促进了多尺度特征的整合。该模型还采用了可分离的空间注意力机制，增强了其辨别空间细节的能力。在标准数据集上进行的经验验证证明了我们提出的模型的有效性，尤其是在分割中小型肿瘤时，参数仅为 0.27M，运算速度为 3.68GFlops。我们的代码见 https://github.com/zzpr/SSCN。

{"title":"Efficient Brain Tumor Segmentation with Lightweight Separable Spatial Convolutional Network","authors":"Hao Zhang, Meng Liu, Yuan Qi, Yang Ning, Shunbo Hu, Liqiang Nie, Wenyin Zhang","doi":"10.1145/3653715","DOIUrl":"https://doi.org/10.1145/3653715","url":null,"abstract":"Accurate and automated segmentation of lesions in brain MRI scans is crucial in diagnostics and treatment planning. Despite the significant achievements of existing approaches, they often require substantial computational resources and fail to fully exploit the synergy between low-level and high-level features. To address these challenges, we introduce the Separable Spatial Convolutional Network (SSCN), an innovative model that refines the U-Net architecture to achieve efficient brain tumor segmentation with minimal computational cost. SSCN integrates the PocketNet paradigm and replaces standard convolutions with depthwise separable convolutions, resulting in a significant reduction in parameters and computational load. Additionally, our feature complementary module enhances the interaction between features across the encoder-decoder structure, facilitating the integration of multi-scale features while maintaining low computational demands. The model also incorporates a separable spatial attention mechanism, enhancing its capability to discern spatial details. Empirical validations on standard datasets demonstrate the effectiveness of our proposed model, especially in segmenting small and medium-sized tumors, with only 0.27M parameters and 3.68GFlops. Our code is available at https://github.com/zzpr/SSCN.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"103 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0