Recently, learned image compression (LIC) has shown significant research potential. Most existing LIC methods are CNN-based or transformer-based or mixed. However, these LIC methods suffer from a certain degree of degradation in global attention performance, as CNN has limited-sized convolution kernels while window partitioning is applied to reduce computational complexity in transformer. This gives rise to the following two issues: (1) The main autoencoder (AE) and hyper AE exhibit limited transformation capabilities due to insufficient global modeling, making it challenging to improve the accuracy of coarse-grained entropy model. (2) The fine-grained entropy model struggles to adaptively utilize a larger range of contexts, because of weaker global modeling capability. In this paper, we propose the LIC with joint enhanced swin transformer (SwinT) and CNN to improve the entropy modeling accuracy. The key in the proposed method is that we enhance the global modeling ability of SwinT by introducing neighborhood window attention while maintaining an acceptable computational complexity and combines the local modeling ability of CNN to form the enhanced SwinT and CNN block (ESTCB). Specifically, we reconstruct the main AE and hyper AE of LIC based on ESTCB, enhancing their global transformation capabilities and resulting in a more accurate coarse-grained entropy model. Besides, we combine ESTCB with the checkerboard mask and the channel autoregressive model to develop a spatial then channel fine-grained entropy model, expanding the scope of LIC adaptive reference contexts. Comprehensive experiments demonstrate that our proposed method achieves state-of-the-art rate-distortion performance compared to existing LIC models.
{"title":"Accurate entropy modeling in learned image compression with joint enchanced SwinT and CNN","authors":"Dongjian Yang, Xiaopeng Fan, Xiandong Meng, Debin Zhao","doi":"10.1007/s00530-024-01405-w","DOIUrl":"https://doi.org/10.1007/s00530-024-01405-w","url":null,"abstract":"<p>Recently, learned image compression (LIC) has shown significant research potential. Most existing LIC methods are CNN-based or transformer-based or mixed. However, these LIC methods suffer from a certain degree of degradation in global attention performance, as CNN has limited-sized convolution kernels while window partitioning is applied to reduce computational complexity in transformer. This gives rise to the following two issues: (1) The main autoencoder (AE) and hyper AE exhibit limited transformation capabilities due to insufficient global modeling, making it challenging to improve the accuracy of coarse-grained entropy model. (2) The fine-grained entropy model struggles to adaptively utilize a larger range of contexts, because of weaker global modeling capability. In this paper, we propose the LIC with joint enhanced swin transformer (SwinT) and CNN to improve the entropy modeling accuracy. The key in the proposed method is that we enhance the global modeling ability of SwinT by introducing neighborhood window attention while maintaining an acceptable computational complexity and combines the local modeling ability of CNN to form the enhanced SwinT and CNN block (ESTCB). Specifically, we reconstruct the main AE and hyper AE of LIC based on ESTCB, enhancing their global transformation capabilities and resulting in a more accurate coarse-grained entropy model. Besides, we combine ESTCB with the checkerboard mask and the channel autoregressive model to develop a spatial then channel fine-grained entropy model, expanding the scope of LIC adaptive reference contexts. Comprehensive experiments demonstrate that our proposed method achieves state-of-the-art rate-distortion performance compared to existing LIC models.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"08 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s00530-024-01403-y
Yingan Cui, Zonghua Yu, Yuqin Feng, Huaijun Wang, Junhuai Li
Video quality assessment is essential for optimizing user experience, enhancing network efficiency, supporting video production and editing, improving advertising effectiveness, and strengthening security in monitoring and other domains. Reacting to the prevailing focus of current research on video detail distortion while overlooking the temporal relationships between video frames and the impact of content-dependent characteristics of the human visual system on video quality, this paper proposes a multi-scale no-reference video quality assessment method based on transformer. On the one hand, spatial features of the video are extracted using a network that combines swin-transformer and deformable convolution, and further information preservation is achieved through mixed pooling of features in video frames. On the other hand, a pyramid aggregation module is utilized to merge long-term and short-term memories, enhancing the ability to capture temporal changes. Experimental results on public datasets such as KoNViD-1k, CVD2014, and LIVE-VQC demonstrate the effectiveness of the proposed method in video quality prediction.
{"title":"A multi-scale no-reference video quality assessment method based on transformer","authors":"Yingan Cui, Zonghua Yu, Yuqin Feng, Huaijun Wang, Junhuai Li","doi":"10.1007/s00530-024-01403-y","DOIUrl":"https://doi.org/10.1007/s00530-024-01403-y","url":null,"abstract":"<p>Video quality assessment is essential for optimizing user experience, enhancing network efficiency, supporting video production and editing, improving advertising effectiveness, and strengthening security in monitoring and other domains. Reacting to the prevailing focus of current research on video detail distortion while overlooking the temporal relationships between video frames and the impact of content-dependent characteristics of the human visual system on video quality, this paper proposes a multi-scale no-reference video quality assessment method based on transformer. On the one hand, spatial features of the video are extracted using a network that combines swin-transformer and deformable convolution, and further information preservation is achieved through mixed pooling of features in video frames. On the other hand, a pyramid aggregation module is utilized to merge long-term and short-term memories, enhancing the ability to capture temporal changes. Experimental results on public datasets such as KoNViD-1k, CVD2014, and LIVE-VQC demonstrate the effectiveness of the proposed method in video quality prediction.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"62 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s00530-024-01399-5
Fazal Waris, Feipeng Da, Shanghuan Liu
Accurate and efficient gender recognition is an essential for many applications such as surveillance, security, and biometrics. Recently, deep learning techniques have made remarkable advancements in feature extraction and have become extensively implemented in various applications, including gender classification. However, despite the numerous studies conducted on the problem, correctly recognizing robust and essential features from face images and efficiently distinguishing them with high accuracy in the wild is still a challenging task for real-world applications. This article proposes an approach that combines deep learning and soft voting-based ensemble model to perform automatic gender classification with high accuracy in an unconstrained environment. In the proposed technique, a novel deep convolutional neural network (DCNN) was designed to extract 128 high-quality and accurate features from face images. The StandardScaler method was then used to pre-process these extracted features, and finally, these preprocessed features were classified with soft voting ensemble learning model combining the outputs from several machine learning classifiers such as random forest (RF), support vector machine (SVM), linear discriminant analysis (LDA), logistic regression (LR), gradient boosting classifier (GBC) and XGBoost to improve the prediction accuracy. The experimental study was performed on the UTK, label faces in the wild (LFW), Adience and FEI datasets. The results attained evidently show that the proposed approach outperforms all current approaches in terms of accuracy across all datasets.
{"title":"Deep learning based features extraction for facial gender classification using ensemble of machine learning technique","authors":"Fazal Waris, Feipeng Da, Shanghuan Liu","doi":"10.1007/s00530-024-01399-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01399-5","url":null,"abstract":"<p>Accurate and efficient gender recognition is an essential for many applications such as surveillance, security, and biometrics. Recently, deep learning techniques have made remarkable advancements in feature extraction and have become extensively implemented in various applications, including gender classification. However, despite the numerous studies conducted on the problem, correctly recognizing robust and essential features from face images and efficiently distinguishing them with high accuracy in the wild is still a challenging task for real-world applications. This article proposes an approach that combines deep learning and soft voting-based ensemble model to perform automatic gender classification with high accuracy in an unconstrained environment. In the proposed technique, a novel deep convolutional neural network (DCNN) was designed to extract 128 high-quality and accurate features from face images. The StandardScaler method was then used to pre-process these extracted features, and finally, these preprocessed features were classified with soft voting ensemble learning model combining the outputs from several machine learning classifiers such as random forest (RF), support vector machine (SVM), linear discriminant analysis (LDA), logistic regression (LR), gradient boosting classifier (GBC) and XGBoost to improve the prediction accuracy. The experimental study was performed on the UTK, label faces in the wild (LFW), Adience and FEI datasets. The results attained evidently show that the proposed approach outperforms all current approaches in terms of accuracy across all datasets.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"367 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-06DOI: 10.1007/s00530-024-01394-w
Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu Thuy Nguyen
In recent years, visual question answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English. To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across four evaluation metrics.
{"title":"ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese","authors":"Khiem Vinh Tran, Hao Phu Phan, Kiet Van Nguyen, Ngan Luu Thuy Nguyen","doi":"10.1007/s00530-024-01394-w","DOIUrl":"https://doi.org/10.1007/s00530-024-01394-w","url":null,"abstract":"<p>In recent years, visual question answering (VQA) has gained significant attention for its diverse applications, including intelligent car assistance, aiding visually impaired individuals, and document image information retrieval using natural language queries. VQA requires effective integration of information from questions and images to generate accurate answers. Neural models for VQA have made remarkable progress on large-scale datasets, with a primary focus on resource-rich languages like English. To address this, we introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese while mitigating biases. The dataset comprises over 26,000 images and 30,000 question-answer pairs (QAs), each question annotated to specify the type of reasoning involved. Leveraging this dataset, we conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations. Furthermore, we present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions. The architecture effectively employs transformers to enable simultaneous reasoning over textual and visual data, merging both modalities at an early model stage. The experimental findings demonstrate that our proposed model achieves state-of-the-art performance across four evaluation metrics.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"73 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-Exposure Fusion (MEF) technique aims to fuse multiple images taken from the same scene at different exposure levels into an image with more details. Although more and more MEF algorithms have been developed, how to effectively evaluate the quality of MEF images has not been thoroughly investigated. To address this issue, a blind quality evaluator for MEF image via joint sparse features and complex-wavelet statistical characteristics is developed. Specifically, considering that color and structure distortions are inevitably introduced during the MEF operations, we first train a color dictionary in the Lab color space based on the color perception mechanism of human visual system, and extract sparse perceptual features to capture the color and structure distortions. Given an MEF image to be evaluated, its components in both luminance and color channels are derived first. Subsequently, these obtained components are sparsely encoded using the trained color dictionary, and the perceived sparse features are extracted from the derived sparse coefficients. In addition, considering the insensitivity of sparse features towards weak structural information in images, complex steerable pyramid decomposition is further performed over the generated chromaticity map. Consequently, perceptual features of magnitude, phase and cross-scale structural similarity index are extracted from complex wavelet coefficients within the chromaticity map as quality-aware features. Experimental results demonstrate that our proposed metric outperforms the existing classic image quality evaluation metrics while maintaining high accordance with human visual perception.
{"title":"Blind quality evaluator for multi-exposure fusion image via joint sparse features and complex-wavelet statistical characteristics","authors":"Benquan Yang, Yueli Cui, Lihong Liu, Guang Chen, Jiamin Xu, Junhao Lin","doi":"10.1007/s00530-024-01404-x","DOIUrl":"https://doi.org/10.1007/s00530-024-01404-x","url":null,"abstract":"<p>Multi-Exposure Fusion (MEF) technique aims to fuse multiple images taken from the same scene at different exposure levels into an image with more details. Although more and more MEF algorithms have been developed, how to effectively evaluate the quality of MEF images has not been thoroughly investigated. To address this issue, a blind quality evaluator for MEF image via joint sparse features and complex-wavelet statistical characteristics is developed. Specifically, considering that color and structure distortions are inevitably introduced during the MEF operations, we first train a color dictionary in the Lab color space based on the color perception mechanism of human visual system, and extract sparse perceptual features to capture the color and structure distortions. Given an MEF image to be evaluated, its components in both luminance and color channels are derived first. Subsequently, these obtained components are sparsely encoded using the trained color dictionary, and the perceived sparse features are extracted from the derived sparse coefficients. In addition, considering the insensitivity of sparse features towards weak structural information in images, complex steerable pyramid decomposition is further performed over the generated chromaticity map. Consequently, perceptual features of magnitude, phase and cross-scale structural similarity index are extracted from complex wavelet coefficients within the chromaticity map as quality-aware features. Experimental results demonstrate that our proposed metric outperforms the existing classic image quality evaluation metrics while maintaining high accordance with human visual perception.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"86 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-05DOI: 10.1007/s00530-024-01397-7
Yu Yuan, Jinlong Shi, Xin Shu, Qiang Qian, Yunna Song, Zhen Ou, Dan Xu, Xin Zuo, YueCheng Yu, Yunhan Sun
Unsupervised Domain Adaptation (UDA) plays a pivotal role in enhancing the segmentation performance of models in the target domain by mitigating the domain shift between the source and target domains. However, Existing UDA image mix methods often overlook the contextual association between classes, limiting the segmentation capability of the model. To address this issue, we propose the context-aware adaptive network that enhances the model’s perception of contextual association information and maintains the contextual associations between different classes in mixed images, thereby improving the adaptability of the model. Firstly, we design a image mix strategy based on dynamic class correlation called DCCMix that constructs class correlation meta groups to preserve the contextual associations between different classes. Simultaneously, DCCMix dynamically adjusts the class proportion of the source domain within the mixed domain to gradually align with the distribution of the target domain, thereby improving training effectiveness. Secondly, the feature-wise fusion module and contextual feature-aware module are designed to better perceive contextual information of images and alleviate the issue of information loss during the feature extraction. Finally, we propose an adaptive class-edge weight to strengthen the segmentation ability of edge pixels in the model. Experimental results demonstrate that our proposed method achieves the mloU of 63.2% and 69.8% on two UDA benchmark tasks: SYNTHIA (rightarrow) Cityscapes and GTA (rightarrow) Cityscapes respectively. The code is available at https://github.com/yuheyuan/CAAN.
{"title":"Context-aware adaptive network for UDA semantic segmentation","authors":"Yu Yuan, Jinlong Shi, Xin Shu, Qiang Qian, Yunna Song, Zhen Ou, Dan Xu, Xin Zuo, YueCheng Yu, Yunhan Sun","doi":"10.1007/s00530-024-01397-7","DOIUrl":"https://doi.org/10.1007/s00530-024-01397-7","url":null,"abstract":"<p>Unsupervised Domain Adaptation (UDA) plays a pivotal role in enhancing the segmentation performance of models in the target domain by mitigating the domain shift between the source and target domains. However, Existing UDA image mix methods often overlook the contextual association between classes, limiting the segmentation capability of the model. To address this issue, we propose the context-aware adaptive network that enhances the model’s perception of contextual association information and maintains the contextual associations between different classes in mixed images, thereby improving the adaptability of the model. Firstly, we design a image mix strategy based on dynamic class correlation called DCCMix that constructs class correlation meta groups to preserve the contextual associations between different classes. Simultaneously, DCCMix dynamically adjusts the class proportion of the source domain within the mixed domain to gradually align with the distribution of the target domain, thereby improving training effectiveness. Secondly, the feature-wise fusion module and contextual feature-aware module are designed to better perceive contextual information of images and alleviate the issue of information loss during the feature extraction. Finally, we propose an adaptive class-edge weight to strengthen the segmentation ability of edge pixels in the model. Experimental results demonstrate that our proposed method achieves the mloU of 63.2% and 69.8% on two UDA benchmark tasks: SYNTHIA <span>(rightarrow)</span> Cityscapes and GTA <span>(rightarrow)</span> Cityscapes respectively. The code is available at https://github.com/yuheyuan/CAAN.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"54 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1007/s00530-024-01396-8
Weiran Chen, Jiaqi Su, Weitao Song, Jialiang Xu, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu
Quality evaluation of handwritten Chinese characters aims to automatically quantify and assess handwritten Chinese characters through computer vision and machine learning technology. It is a topic of great concern for many handwriting learners and calligraphy enthusiasts. Over the past years, with the continuous development of computer technology, various new techniques have achieved flourishing and thriving progress. Nevertheless, how to realize fast and accurate character evaluation without human intervention is still one of the most challenging tasks in artificial intelligence. In this paper, we aim to provide a comprehensive survey of the existing handwritten Chinese character quality evaluation methods. Specifically, we first illustrate the research scope and background of the task. Then we outline our literature selection and analysis methodology, and review a series of related concepts, including common Chinese character features, evaluation metrics and classical machine learning models. After that, relying on the adopted mechanism and algorithm, we categorize the evaluation methods into two major groups: traditional methods and machine-learning-based methods. Representative approaches in each group are summarized, and their strengths and limitations are discussed in detail. Based on 191 papers in this survey, we finally conclude our paper with the challenges and future directions, with the expectation to provide some valuable illuminations for researchers in this field.
{"title":"Quality evaluation methods of handwritten Chinese characters: a comprehensive survey","authors":"Weiran Chen, Jiaqi Su, Weitao Song, Jialiang Xu, Guiqian Zhu, Ying Li, Yi Ji, Chunping Liu","doi":"10.1007/s00530-024-01396-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01396-8","url":null,"abstract":"<p>Quality evaluation of handwritten Chinese characters aims to automatically quantify and assess handwritten Chinese characters through computer vision and machine learning technology. It is a topic of great concern for many handwriting learners and calligraphy enthusiasts. Over the past years, with the continuous development of computer technology, various new techniques have achieved flourishing and thriving progress. Nevertheless, how to realize fast and accurate character evaluation without human intervention is still one of the most challenging tasks in artificial intelligence. In this paper, we aim to provide a comprehensive survey of the existing handwritten Chinese character quality evaluation methods. Specifically, we first illustrate the research scope and background of the task. Then we outline our literature selection and analysis methodology, and review a series of related concepts, including common Chinese character features, evaluation metrics and classical machine learning models. After that, relying on the adopted mechanism and algorithm, we categorize the evaluation methods into two major groups: traditional methods and machine-learning-based methods. Representative approaches in each group are summarized, and their strengths and limitations are discussed in detail. Based on 191 papers in this survey, we finally conclude our paper with the challenges and future directions, with the expectation to provide some valuable illuminations for researchers in this field.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"10 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1007/s00530-024-01400-1
Zhiyuan Yang, Changming Zhu, Zishi Li
Recently, deep multi-view clustering leveraging autoencoders has garnered significant attention due to its ability to simultaneously enhance feature learning capabilities and optimize clustering outcomes. However, existing autoencoder-based deep multi-view clustering methods often exhibit a tendency to either overly emphasize view-specific information, thus neglecting shared information across views, or alternatively, to place undue focus on shared information, resulting in the dilution of complementary information from individual views. Given the principle that commonality resides within individuality, this paper proposes a staged training approach that comprises two phases: pre-training and fine-tuning. The pre-training phase primarily focuses on learning view-specific information, while the fine-tuning phase aims to doubly enhance commonality across views while maintaining these specific details. Specifically, we learn and extract the specific information of each view through the autoencoder in the pre-training stage. After entering the fine-tuning stage, we first initially enhance the commonality between independent specific views through the transformer layer, and then further strengthen these commonalities through contrastive learning on the semantic labels of each view, so as to obtain more accurate clustering results.
{"title":"Deep contrastive multi-view clustering with doubly enhanced commonality","authors":"Zhiyuan Yang, Changming Zhu, Zishi Li","doi":"10.1007/s00530-024-01400-1","DOIUrl":"https://doi.org/10.1007/s00530-024-01400-1","url":null,"abstract":"<p>Recently, deep multi-view clustering leveraging autoencoders has garnered significant attention due to its ability to simultaneously enhance feature learning capabilities and optimize clustering outcomes. However, existing autoencoder-based deep multi-view clustering methods often exhibit a tendency to either overly emphasize view-specific information, thus neglecting shared information across views, or alternatively, to place undue focus on shared information, resulting in the dilution of complementary information from individual views. Given the principle that commonality resides within individuality, this paper proposes a staged training approach that comprises two phases: pre-training and fine-tuning. The pre-training phase primarily focuses on learning view-specific information, while the fine-tuning phase aims to doubly enhance commonality across views while maintaining these specific details. Specifically, we learn and extract the specific information of each view through the autoencoder in the pre-training stage. After entering the fine-tuning stage, we first initially enhance the commonality between independent specific views through the transformer layer, and then further strengthen these commonalities through contrastive learning on the semantic labels of each view, so as to obtain more accurate clustering results.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"32 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1007/s00530-024-01402-z
Peng Gao, Chuanqi Tao, Donghai Guan
Humor segment prediction in video involves the comprehension and analysis of humor. Traditional humor prediction has been text-based; however, with the evolution of multimedia, the focus has shifted to multimodal approaches in humor prediction, marking a current trend in research. In recent years, determining whether a video is humorous has remained a challenge within the domain of sentiment analysis. Researchers have proposed multiple data fusion methods to address humor prediction and sentiment analysis. Within the realm of studying humor and emotions, text modality assumes a leading role, while audio and video modalities serve as supplementary data sources for multimodal humor prediction. However, these auxiliary modalities contain significant irrelevant information unrelated to the prediction task, resulting in redundancy. Current multimodal fusion models primarily emphasize fusion methods but overlook the issue of high redundancy in auxiliary modalities. The lack of research on reducing redundancy in auxiliary modalities introduces noise, thereby increasing the overall training complexity of models and diminishing predictive accuracy. Hence, developing a humor prediction method that effectively reduces redundancy in auxiliary modalities is pivotal for advancing multimodal research. In this paper, we propose the Feature Enhanced Fusion Network (FEF-Net), leveraging cross-modal attention to augment features from auxiliary modalities using knowledge from textual data. This mechanism generates weights to emphasize the redundancy of each corresponding time slice in the auxiliary modality. Further, employing Transformer encoders extracts high-level features for each modality, thereby enhancing the performance of humor prediction models. Experimental comparisons were conducted using the UR-FUNNY and MUStARD multimodal humor prediction models, revealing a 3.2% improvement in ‘Acc-2’ compared to the optimal model.
{"title":"FEF-Net: feature enhanced fusion network with crossmodal attention for multimodal humor prediction","authors":"Peng Gao, Chuanqi Tao, Donghai Guan","doi":"10.1007/s00530-024-01402-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01402-z","url":null,"abstract":"<p>Humor segment prediction in video involves the comprehension and analysis of humor. Traditional humor prediction has been text-based; however, with the evolution of multimedia, the focus has shifted to multimodal approaches in humor prediction, marking a current trend in research. In recent years, determining whether a video is humorous has remained a challenge within the domain of sentiment analysis. Researchers have proposed multiple data fusion methods to address humor prediction and sentiment analysis. Within the realm of studying humor and emotions, text modality assumes a leading role, while audio and video modalities serve as supplementary data sources for multimodal humor prediction. However, these auxiliary modalities contain significant irrelevant information unrelated to the prediction task, resulting in redundancy. Current multimodal fusion models primarily emphasize fusion methods but overlook the issue of high redundancy in auxiliary modalities. The lack of research on reducing redundancy in auxiliary modalities introduces noise, thereby increasing the overall training complexity of models and diminishing predictive accuracy. Hence, developing a humor prediction method that effectively reduces redundancy in auxiliary modalities is pivotal for advancing multimodal research. In this paper, we propose the Feature Enhanced Fusion Network (FEF-Net), leveraging cross-modal attention to augment features from auxiliary modalities using knowledge from textual data. This mechanism generates weights to emphasize the redundancy of each corresponding time slice in the auxiliary modality. Further, employing Transformer encoders extracts high-level features for each modality, thereby enhancing the performance of humor prediction models. Experimental comparisons were conducted using the UR-FUNNY and MUStARD multimodal humor prediction models, revealing a 3.2% improvement in ‘Acc-2’ compared to the optimal model.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"44 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141551797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1007/s00530-024-01398-6
Wei Wang, Daiyin Zhu, Kedi Hu
Unmanned aerial vehicle (UAV) air-to-ground remote sensing technology, has the advantages of long flight duration, real-time image transmission, wide applicability, low cost, and so on. To better preserve the integrity of image features during transmission and storage, and improve efficiency in the meanwhile, image compression is a very important link. Nowadays the image compressor based on deep learning framework has been updating as the technological development. However, in order to obtain enough bit rates to fit the performance curve, there is always a severe computational burden, especially for multispectral image compression. This problem arises not only because the complexity of the algorithm is deepening, but also repeated training with rate-distortion optimization. In this paper, a channel-gained single-model network with variable rate for multispectral image compression is proposed. First, a channel gained module is introduced to map the channel content of the image to vector domain as amplitude factors, which leads to representation scaling, as well as obtaining the image representation of different bit rates in a single model. Second, after extracting spatial-spectral features, a plug-and-play dynamic response attention mechanism module is applied to take good care of distinguishing the content correlation of features and weighting the important area dynamically without adding extra parameters. Besides, a hyperprior autoencoder is used to make full use of edge information for entropy estimation, which contributes to a more accurate entropy model. The experiments prove that the proposed method greatly reduces the computational cost, while maintaining good compression performance and surpasses JPEG2000 and some other algorithms based on deep learning in PSNR, MSSSIM and MSA.
{"title":"A channel-gained single-model network with variable rate for multispectral image compression in UAV air-to-ground remote sensing","authors":"Wei Wang, Daiyin Zhu, Kedi Hu","doi":"10.1007/s00530-024-01398-6","DOIUrl":"https://doi.org/10.1007/s00530-024-01398-6","url":null,"abstract":"<p>Unmanned aerial vehicle (UAV) air-to-ground remote sensing technology, has the advantages of long flight duration, real-time image transmission, wide applicability, low cost, and so on. To better preserve the integrity of image features during transmission and storage, and improve efficiency in the meanwhile, image compression is a very important link. Nowadays the image compressor based on deep learning framework has been updating as the technological development. However, in order to obtain enough bit rates to fit the performance curve, there is always a severe computational burden, especially for multispectral image compression. This problem arises not only because the complexity of the algorithm is deepening, but also repeated training with rate-distortion optimization. In this paper, a channel-gained single-model network with variable rate for multispectral image compression is proposed. First, a channel gained module is introduced to map the channel content of the image to vector domain as amplitude factors, which leads to representation scaling, as well as obtaining the image representation of different bit rates in a single model. Second, after extracting spatial-spectral features, a plug-and-play dynamic response attention mechanism module is applied to take good care of distinguishing the content correlation of features and weighting the important area dynamically without adding extra parameters. Besides, a hyperprior autoencoder is used to make full use of edge information for entropy estimation, which contributes to a more accurate entropy model. The experiments prove that the proposed method greatly reduces the computational cost, while maintaining good compression performance and surpasses JPEG2000 and some other algorithms based on deep learning in PSNR, MSSSIM and MSA.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"19 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141506765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}