The COVID-19 pandemic has made us all understand that wearing a face mask protects us from the spread of respiratory viruses. The face authentication systems, which are trained on the basis of facial key points such as the eyes, nose, and mouth, found it difficult to identify the person when the majority of the face is covered by the face mask. Removing the mask for authentication will cause the infection to spread. The possible solutions are: (a) to train the face recognition systems to identify the person with the upper face features (b) Reconstruct the complete face of the person with a generative model. (c) train the model with a dataset of the masked faces of the people. In this paper, we explore the scope of generative models for image synthesis. We used stable diffusion to generate masked face images of popular celebrities on various text prompts. A realistic dataset of 15K masked face images of 100 celebrities is generated and is called the Realistic Synthetic Masked Face Dataset (RSMFD). The model and the generated dataset will be made public so that researchers can augment the dataset. According to our knowledge, this is the largest masked face recognition dataset with realistic images. The generated images were tested on popular deep face recognition models and achieved significant results. The dataset is also trained and tested on some of the famous image classification models, and the results are competitive. The dataset is available on this link:- https://drive.google.com/drive/folders/1yetcgUOL1TOP4rod1geGsOkIrIJHtcEw?usp=sharing
{"title":"Text-Guided Synthesis of Masked Face Images","authors":"Anjali T, Masilamani V","doi":"10.1145/3654667","DOIUrl":"https://doi.org/10.1145/3654667","url":null,"abstract":"<p>The COVID-19 pandemic has made us all understand that wearing a face mask protects us from the spread of respiratory viruses. The face authentication systems, which are trained on the basis of facial key points such as the eyes, nose, and mouth, found it difficult to identify the person when the majority of the face is covered by the face mask. Removing the mask for authentication will cause the infection to spread. The possible solutions are: (a) to train the face recognition systems to identify the person with the upper face features (b) Reconstruct the complete face of the person with a generative model. (c) train the model with a dataset of the masked faces of the people. In this paper, we explore the scope of generative models for image synthesis. We used stable diffusion to generate masked face images of popular celebrities on various text prompts. A realistic dataset of 15K masked face images of 100 celebrities is generated and is called the Realistic Synthetic Masked Face Dataset (RSMFD). The model and the generated dataset will be made public so that researchers can augment the dataset. According to our knowledge, this is the largest masked face recognition dataset with realistic images. The generated images were tested on popular deep face recognition models and achieved significant results. The dataset is also trained and tested on some of the famous image classification models, and the results are competitive. The dataset is available on this link:- https://drive.google.com/drive/folders/1yetcgUOL1TOP4rod1geGsOkIrIJHtcEw?usp=sharing\u0000</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"1 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vehicle re-identification (v-reID) is a crucial and challenging task in the intelligent transportation systems (ITS). While vehicle re-identification plays a role in analysing traffic behaviour, criminal investigation, or automatic toll collection, it is also a key component for the construction of smart cities. With the recent introduction of transformer models and their rapid development in computer vision, vehicle re-identification has also made significant progress in performance and development over 2021-2023. This bite-sized review is the first to summarize existing works in vehicle re-identification using pure transformer models and examine their capabilities. We introduce the various applications and challenges, different datasets, evaluation strategies and loss functions in v-reID. A comparison between existing state-of-the-art methods based on different research areas is then provided. Finally, we discuss possible future research directions and provide a checklist on how to implement a v-reID model. This checklist is useful for an interested researcher or practitioner who is starting their work in this field, and also for anyone who seeks an insight into how to implement an AI model in computer vision using v-reID.
{"title":"Paying Attention to Vehicles: A Systematic Review on Transformer-Based Vehicle Re-Identification","authors":"Yan Qian, Johan Barthélemy, Bo Du, Jun Shen","doi":"10.1145/3655623","DOIUrl":"https://doi.org/10.1145/3655623","url":null,"abstract":"<p>Vehicle re-identification (v-reID) is a crucial and challenging task in the intelligent transportation systems (ITS). While vehicle re-identification plays a role in analysing traffic behaviour, criminal investigation, or automatic toll collection, it is also a key component for the construction of smart cities. With the recent introduction of transformer models and their rapid development in computer vision, vehicle re-identification has also made significant progress in performance and development over 2021-2023. This bite-sized review is the first to summarize existing works in vehicle re-identification using pure transformer models and examine their capabilities. We introduce the various applications and challenges, different datasets, evaluation strategies and loss functions in v-reID. A comparison between existing state-of-the-art methods based on different research areas is then provided. Finally, we discuss possible future research directions and provide a checklist on how to implement a v-reID model. This checklist is useful for an interested researcher or practitioner who is starting their work in this field, and also for anyone who seeks an insight into how to implement an AI model in computer vision using v-reID.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"18 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140575289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tingting Han, Quan Zhou, Jun Yu, Zhou Yu, Jianhui Zhang, Sicheng Zhao
Video summarization remains a challenging task despite increasing research efforts. Traditional methods focus solely on long-range temporal modeling of video frames, overlooking important local motion information which can not be captured by frame-level video representations. In this paper, we propose the Parameter-free Motion Attention Module (PMAM) to exploit the crucial motion clues potentially contained in adjacent video frames, using a multi-head attention architecture. The PMAM requires no additional training for model parameters, leading to an efficient and effective understanding of video dynamics. Moreover, we introduce the Multi-feature Motion Attention Network (MMAN), integrating the parameter-free motion attention module with local and global multi-head attention based on object-centric and scene-centric video representations. The synergistic combination of local motion information, extracted by the proposed PMAM, with long-range interactions modeled by the local and global multi-head attention mechanism, can significantly enhance the performance of video summarization. Extensive experimental results on the benchmark datasets, SumMe and TVSum, demonstrate that the proposed MMAN outperforms other state-of-the-art methods, resulting in remarkable performance gains.
{"title":"Effective Video Summarization by Extracting Parameter-free Motion Attention","authors":"Tingting Han, Quan Zhou, Jun Yu, Zhou Yu, Jianhui Zhang, Sicheng Zhao","doi":"10.1145/3654670","DOIUrl":"https://doi.org/10.1145/3654670","url":null,"abstract":"<p>Video summarization remains a challenging task despite increasing research efforts. Traditional methods focus solely on long-range temporal modeling of video frames, overlooking important local motion information which can not be captured by frame-level video representations. In this paper, we propose the Parameter-free Motion Attention Module (PMAM) to exploit the crucial motion clues potentially contained in adjacent video frames, using a multi-head attention architecture. The PMAM requires no additional training for model parameters, leading to an efficient and effective understanding of video dynamics. Moreover, we introduce the Multi-feature Motion Attention Network (MMAN), integrating the parameter-free motion attention module with local and global multi-head attention based on object-centric and scene-centric video representations. The synergistic combination of local motion information, extracted by the proposed PMAM, with long-range interactions modeled by the local and global multi-head attention mechanism, can significantly enhance the performance of video summarization. Extensive experimental results on the benchmark datasets, SumMe and TVSum, demonstrate that the proposed MMAN outperforms other state-of-the-art methods, resulting in remarkable performance gains.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"2015 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140574969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo
Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models will be made available upon acceptance.
{"title":"4D Facial Expression Diffusion Model","authors":"Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo","doi":"10.1145/3653455","DOIUrl":"https://doi.org/10.1145/3653455","url":null,"abstract":"<p>Facial expression generation is one of the most challenging and long-sought aspects of character animation, with many interesting applications. The challenging task, traditionally having relied heavily on digital craftspersons, remains yet to be explored. In this paper, we introduce a generative framework for generating 3D facial expression sequences (i.e. 4D faces) that can be conditioned on different inputs to animate an arbitrary 3D face mesh. It is composed of two tasks: (1) Learning the generative model that is trained over a set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input facial mesh driven by the generated landmark sequences. The generative model is based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved remarkable success in generative tasks of other domains. While it can be trained unconditionally, its reverse process can still be conditioned by various condition signals. This allows us to efficiently develop several downstream tasks involving various conditional generation, by using expression labels, text, partial sequences, or simply a facial geometry. To obtain the full mesh deformation, we then develop a landmark-guided encoder-decoder to apply the geometrical deformation embedded in landmarks on a given facial mesh. Experiments show that our model has learned to generate realistic, quality expressions solely from the dataset of relatively small size, improving over the state-of-the-art methods. Videos and qualitative comparisons with other methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models will be made available upon acceptance.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"53 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140315163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Text-to-Image synthesis aims to generate an accurate and semantically consistent image from a given text description. However, it is difficult for existing generative methods to generate semantically complete images from a single piece of text. Some works try to expand the input text to multiple captions via retrieving similar descriptions of the input text from the training set, but still fail to fill in missing image semantics. In this paper, we propose a GAN-based approach to Imagine, Select, and Fuse for Text-to-Image synthesis, named ISF-GAN. The proposed ISF-GAN contains Imagine Stage and Select and Fuse Stage to solve the above problems. First, the Imagine Stage proposes a text completion and enrichment module. This module guides a GPT-based model to enrich the text expression beyond the original dataset. Second, the Select and Fuse Stage selects qualified text descriptions, and then introduces a cross-modal attentional mechanism to interact these different sentences with the image features at different scales. In short, our proposed model enriches the input text information for completing missing semantics and introduces a cross-modal attentional mechanism to maximize the utilization of enriched text information to generate semantically consistent images. Experimental results on CUB, Oxford-102, and CelebA-HQ datasets prove the effectiveness and superiority of the proposed network. Code is available at https://github.com/Feilingg/ISF-GAN.
{"title":"ISF-GAN: Imagine, Select, and Fuse with GPT-Based Text Enrichment for Text-to-Image Synthesis","authors":"Yefei Sheng, Ming Tao, Jie Wang, Bing-Kun Bao","doi":"10.1145/3650033","DOIUrl":"https://doi.org/10.1145/3650033","url":null,"abstract":"<p>Text-to-Image synthesis aims to generate an accurate and semantically consistent image from a given text description. However, it is difficult for existing generative methods to generate semantically complete images from a single piece of text. Some works try to expand the input text to multiple captions via retrieving similar descriptions of the input text from the training set, but still fail to fill in missing image semantics. In this paper, we propose a GAN-based approach to Imagine, Select, and Fuse for Text-to-Image synthesis, named ISF-GAN. The proposed ISF-GAN contains Imagine Stage and Select and Fuse Stage to solve the above problems. First, the Imagine Stage proposes a text completion and enrichment module. This module guides a GPT-based model to enrich the text expression beyond the original dataset. Second, the Select and Fuse Stage selects qualified text descriptions, and then introduces a cross-modal attentional mechanism to interact these different sentences with the image features at different scales. In short, our proposed model enriches the input text information for completing missing semantics and introduces a cross-modal attentional mechanism to maximize the utilization of enriched text information to generate semantically consistent images. Experimental results on CUB, Oxford-102, and CelebA-HQ datasets prove the effectiveness and superiority of the proposed network. Code is available at https://github.com/Feilingg/ISF-GAN.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"14 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140315401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shizhan Liu, Weiyao Lin, Yihang Chen, Yufeng Zhang, Wenrui Dai, John See, Hongkai Xiong
The rapid advancement of multimedia and imaging technologies has resulted in increasingly diverse visual and semantic data. A large range of applications such as remote-assisted driving requires the amalgamated storage and transmission of various visual and semantic data. However, existing works suffer from the limitation of insufficiently exploiting the redundancy between different types of data. In this paper, we propose a unified framework to jointly compress a diverse spectrum of visual and semantic data, including images, point clouds, segmentation maps, object attributes and relations. We develop a unifying process that embeds the representations of these data into a joint embedding graph according to their categories, which enables flexible handling of joint compression tasks for various visual and semantic data. To fully leverage the redundancy between different data types, we further introduce an embedding-based adaptive joint encoding process and a Semantic Adaptation Module to efficiently encode diverse data based on the learned embeddings in the joint embedding graph. Experiments on the Cityscapes, MSCOCO, and KITTI datasets demonstrate the superiority of our framework, highlighting promising steps toward scalable multimedia processing.
{"title":"A Unified Framework for Jointly Compressing Visual and Semantic Data","authors":"Shizhan Liu, Weiyao Lin, Yihang Chen, Yufeng Zhang, Wenrui Dai, John See, Hongkai Xiong","doi":"10.1145/3654800","DOIUrl":"https://doi.org/10.1145/3654800","url":null,"abstract":"<p>The rapid advancement of multimedia and imaging technologies has resulted in increasingly diverse visual and semantic data. A large range of applications such as remote-assisted driving requires the amalgamated storage and transmission of various visual and semantic data. However, existing works suffer from the limitation of insufficiently exploiting the redundancy between different types of data. In this paper, we propose a unified framework to jointly compress a diverse spectrum of visual and semantic data, including images, point clouds, segmentation maps, object attributes and relations. We develop a unifying process that embeds the representations of these data into a joint embedding graph according to their categories, which enables flexible handling of joint compression tasks for various visual and semantic data. To fully leverage the redundancy between different data types, we further introduce an embedding-based adaptive joint encoding process and a Semantic Adaptation Module to efficiently encode diverse data based on the learned embeddings in the joint embedding graph. Experiments on the Cityscapes, MSCOCO, and KITTI datasets demonstrate the superiority of our framework, highlighting promising steps toward scalable multimedia processing.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"197 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140315162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junjian Huang, Hao Ren, Shulin Liu, Yong Liu, Chuanlu Lv, Jiawen Lu, Changyong Xie, Hong Lu
Images taken under low-light conditions suffer from poor visibility, color distortion and graininess, all of which degrade the image quality and hamper the performance of downstream vision tasks, such as object detection and instance segmentation in the field of autonomous driving, making low-light enhancement an indispensable basic component of high-level visual tasks. Low-light enhancement aims to mitigate these issues, and has garnered extensive attention and research over several decades. The primary challenge in low-light image enhancement arises from the low signal-to-noise ratio (SNR) caused by insufficient lighting. This challenge becomes even more pronounced in near-zero lux conditions, where noise overwhelms the available image information. Both traditional image signal processing (ISP) pipeline and conventional low-light image enhancement methods struggle in such scenarios. Recently, deep neural networks have been used to address this challenge. These networks take unmodified RAW images as input and produce the enhanced sRGB images, forming a deep learning-based ISP pipeline. However, most of these networks are computationally expensive and thus far from practical use. In this paper, we propose a lightweight model called attentive dilated U-Net (ADU-Net) to tackle this issue. Our model incorporates several innovative designs, including an asymmetric U-shape architecture, dilated residual modules (DRMs) for feature extraction, and attentive fusion modules (AFMs) for feature fusion. The DRMs provide strong representative capability while the AFMs effectively leverage low-level texture information and high-level semantic information within the network. Both modules employ a lightweight design but offer significant performance gains. Extensive experiments demonstrate our method is highly-effective, achieving an excellent balance between image quality and computational complexity, i.e., taking less than 4ms for a high-definition 4K image on a single GTX 1080Ti GPU and yet maintaining competitive visual quality. Furthermore, our method exhibits pleasing scalability and generalizability, highlighting its potential for widespread applicability.
{"title":"Real-time Attentive Dilated U-Net for Extremely Dark Image Enhancement","authors":"Junjian Huang, Hao Ren, Shulin Liu, Yong Liu, Chuanlu Lv, Jiawen Lu, Changyong Xie, Hong Lu","doi":"10.1145/3654668","DOIUrl":"https://doi.org/10.1145/3654668","url":null,"abstract":"<p>Images taken under low-light conditions suffer from poor visibility, color distortion and graininess, all of which degrade the image quality and hamper the performance of downstream vision tasks, such as object detection and instance segmentation in the field of autonomous driving, making low-light enhancement an indispensable basic component of high-level visual tasks. Low-light enhancement aims to mitigate these issues, and has garnered extensive attention and research over several decades. The primary challenge in low-light image enhancement arises from the low signal-to-noise ratio (SNR) caused by insufficient lighting. This challenge becomes even more pronounced in near-zero lux conditions, where noise overwhelms the available image information. Both traditional image signal processing (ISP) pipeline and conventional low-light image enhancement methods struggle in such scenarios. Recently, deep neural networks have been used to address this challenge. These networks take unmodified RAW images as input and produce the enhanced sRGB images, forming a deep learning-based ISP pipeline. However, most of these networks are computationally expensive and thus far from practical use. In this paper, we propose a lightweight model called attentive dilated U-Net (ADU-Net) to tackle this issue. Our model incorporates several innovative designs, including an asymmetric U-shape architecture, dilated residual modules (DRMs) for feature extraction, and attentive fusion modules (AFMs) for feature fusion. The DRMs provide strong representative capability while the AFMs effectively leverage low-level texture information and high-level semantic information within the network. Both modules employ a lightweight design but offer significant performance gains. Extensive experiments demonstrate our method is highly-effective, achieving an excellent balance between image quality and computational complexity, <i>i</i>.<i>e</i>., taking less than 4ms for a high-definition 4K image on a single GTX 1080Ti GPU and yet maintaining competitive visual quality. Furthermore, our method exhibits pleasing scalability and generalizability, highlighting its potential for widespread applicability.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"52 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140297581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Baoli Sun, Xinchen Ye, Tiantian Yan, Zhihui Wang, Haojie Li, Zhiyong Wang
Fine-grained video action recognition aims to identify minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, i.e., FineGym and Diving48, and two action recognition datasets, i.e., Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.
{"title":"Discriminative Segment Focus Network for Fine-grained Video Action Recognition","authors":"Baoli Sun, Xinchen Ye, Tiantian Yan, Zhihui Wang, Haojie Li, Zhiyong Wang","doi":"10.1145/3654671","DOIUrl":"https://doi.org/10.1145/3654671","url":null,"abstract":"<p>Fine-grained video action recognition aims to identify minor and discriminative variations among fine categories of actions. While many recent action recognition methods have been proposed to better model spatio-temporal representations, how to model the interactions among discriminative atomic actions to effectively characterize inter-class and intra-class variations has been neglected, which is vital for understanding fine-grained actions. In this work, we devise a Discriminative Segment Focus Network (DSFNet) to mine the discriminability of segment correlations and localize discriminative action-relevant segments for fine-grained video action recognition. Firstly, we propose a hierarchic correlation reasoning (HCR) module which explicitly establishes correlations between different segments at multiple temporal scales and enhances each segment by exploiting the correlations with other segments. Secondly, a discriminative segment focus (DSF) module is devised to localize the most action-relevant segments from the enhanced representations of HCR by enforcing the consistency between the discriminability and the classification confidence of a given segment with a consistency constraint. Finally, these localized segment representations are combined with the global action representation of the whole video for boosting final recognition. Extensive experimental results on two fine-grained action recognition datasets, <i>i.e.</i>, FineGym and Diving48, and two action recognition datasets, <i>i.e.</i>, Kinetics400 and Something-Something, demonstrate the effectiveness of our approach compared with the state-of-the-art methods.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"55 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140297509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Once a video sequence is organized as basic shot units, it is of great interest to temporally link shots into semantic-compact scene segments to facilitate long video understanding. However, it still challenges existing video scene boundary detection methods to handle various visual semantics and complex shot relations in video scenes. We proposed a novel self-supervised learning method, Video Scene Montage for Boundary Detection (VSMBD), to extract rich shot semantics and learn shot relations using unlabeled videos. More specifically, we present Video Scene Montage (VSM) to synthesize reliable pseudo scene boundaries, which learns task-related semantic relations between shots in a self-supervised manner. To lay a solid foundation for modeling semantic relations between shots, we decouple visual semantics of shots into foreground and background. Instead of costly learning from scratch as in most previous self-supervised learning methods, we build our model upon large-scale pre-trained visual encoders to extract the foreground and background features. Experimental results demonstrate VSMBD trains a model with strong capability in capturing shot relations, surpassing previous methods by significant margins. The code is available at https://github.com/mini-mind/VSMBD.
{"title":"Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection","authors":"Jiawei Tan, Pingan Yang, Lu Chen, Hongxing Wang","doi":"10.1145/3654669","DOIUrl":"https://doi.org/10.1145/3654669","url":null,"abstract":"<p>Once a video sequence is organized as basic shot units, it is of great interest to temporally link shots into semantic-compact scene segments to facilitate long video understanding. However, it still challenges existing video scene boundary detection methods to handle various visual semantics and complex shot relations in video scenes. We proposed a novel self-supervised learning method, Video Scene Montage for Boundary Detection (VSMBD), to extract rich shot semantics and learn shot relations using unlabeled videos. More specifically, we present Video Scene Montage (VSM) to synthesize reliable pseudo scene boundaries, which learns task-related semantic relations between shots in a self-supervised manner. To lay a solid foundation for modeling semantic relations between shots, we decouple visual semantics of shots into foreground and background. Instead of costly learning from scratch as in most previous self-supervised learning methods, we build our model upon large-scale pre-trained visual encoders to extract the foreground and background features. Experimental results demonstrate VSMBD trains a model with strong capability in capturing shot relations, surpassing previous methods by significant margins. The code is available at https://github.com/mini-mind/VSMBD.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"22 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140297698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Zhang, Meng Liu, Yuan Qi, Yang Ning, Shunbo Hu, Liqiang Nie, Wenyin Zhang
Accurate and automated segmentation of lesions in brain MRI scans is crucial in diagnostics and treatment planning. Despite the significant achievements of existing approaches, they often require substantial computational resources and fail to fully exploit the synergy between low-level and high-level features. To address these challenges, we introduce the Separable Spatial Convolutional Network (SSCN), an innovative model that refines the U-Net architecture to achieve efficient brain tumor segmentation with minimal computational cost. SSCN integrates the PocketNet paradigm and replaces standard convolutions with depthwise separable convolutions, resulting in a significant reduction in parameters and computational load. Additionally, our feature complementary module enhances the interaction between features across the encoder-decoder structure, facilitating the integration of multi-scale features while maintaining low computational demands. The model also incorporates a separable spatial attention mechanism, enhancing its capability to discern spatial details. Empirical validations on standard datasets demonstrate the effectiveness of our proposed model, especially in segmenting small and medium-sized tumors, with only 0.27M parameters and 3.68GFlops. Our code is available at https://github.com/zzpr/SSCN.
{"title":"Efficient Brain Tumor Segmentation with Lightweight Separable Spatial Convolutional Network","authors":"Hao Zhang, Meng Liu, Yuan Qi, Yang Ning, Shunbo Hu, Liqiang Nie, Wenyin Zhang","doi":"10.1145/3653715","DOIUrl":"https://doi.org/10.1145/3653715","url":null,"abstract":"<p>Accurate and automated segmentation of lesions in brain MRI scans is crucial in diagnostics and treatment planning. Despite the significant achievements of existing approaches, they often require substantial computational resources and fail to fully exploit the synergy between low-level and high-level features. To address these challenges, we introduce the Separable Spatial Convolutional Network (SSCN), an innovative model that refines the U-Net architecture to achieve efficient brain tumor segmentation with minimal computational cost. SSCN integrates the PocketNet paradigm and replaces standard convolutions with depthwise separable convolutions, resulting in a significant reduction in parameters and computational load. Additionally, our feature complementary module enhances the interaction between features across the encoder-decoder structure, facilitating the integration of multi-scale features while maintaining low computational demands. The model also incorporates a separable spatial attention mechanism, enhancing its capability to discern spatial details. Empirical validations on standard datasets demonstrate the effectiveness of our proposed model, especially in segmenting small and medium-sized tumors, with only 0.27M parameters and 3.68GFlops. Our code is available at https://github.com/zzpr/SSCN.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"103 1","pages":""},"PeriodicalIF":5.1,"publicationDate":"2024-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140203638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}