Pub Date : 2024-08-01DOI: 10.1007/s00530-024-01435-4
Xiang Gao, Sining Wu, Ying Zhou, Fan Wang, Xiaopeng Hu
Recently, Transformer-based methods have made significant breakthroughs for single image super-resolution (SISR) but with considerable computation overheads. In this paper, we propose a novel Linear Complexity Transformer (LCFormer) for efficient image super-resolution. Specifically, since the vanilla SA has quadratic complexity and often ignores potential correlations among different data samples, External Attention (EA) is introduced into Transformer to reduce the quadratic complexity to linear and implicitly considers the correlations across the whole dataset. To improve training speed and performance, Root Mean Square Layer Normalization (RMSNorm) is adopted in the Transformer layer. Moreover, an Efficient Gated Depth-wise-conv Feed-forward Network (EGDFN) is designed by the gate mechanism and depth-wise convolutions in Transformer for feature representation with an efficient implementation. The proposed LCFormer achieves comparable or superior performance to existing Transformer-based methods. However, the computation complexity and GPU memory consumption have been dramatically reduced. Extensive experiments demonstrate that LCFormer achieves competitive accuracy and visual improvements against other state-of-the-art methods and reaches a trade-off between model performance and computation costs.
{"title":"LCFormer: linear complexity transformer for efficient image super-resolution","authors":"Xiang Gao, Sining Wu, Ying Zhou, Fan Wang, Xiaopeng Hu","doi":"10.1007/s00530-024-01435-4","DOIUrl":"https://doi.org/10.1007/s00530-024-01435-4","url":null,"abstract":"<p>Recently, Transformer-based methods have made significant breakthroughs for single image super-resolution (SISR) but with considerable computation overheads. In this paper, we propose a novel Linear Complexity Transformer (LCFormer) for efficient image super-resolution. Specifically, since the vanilla SA has quadratic complexity and often ignores potential correlations among different data samples, External Attention (EA) is introduced into Transformer to reduce the quadratic complexity to linear and implicitly considers the correlations across the whole dataset. To improve training speed and performance, Root Mean Square Layer Normalization (RMSNorm) is adopted in the Transformer layer. Moreover, an Efficient Gated Depth-wise-conv Feed-forward Network (EGDFN) is designed by the gate mechanism and depth-wise convolutions in Transformer for feature representation with an efficient implementation. The proposed LCFormer achieves comparable or superior performance to existing Transformer-based methods. However, the computation complexity and GPU memory consumption have been dramatically reduced. Extensive experiments demonstrate that LCFormer achieves competitive accuracy and visual improvements against other state-of-the-art methods and reaches a trade-off between model performance and computation costs.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"76 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsupervised representation learning for 3D shapes has become a critical problem for large-scale 3D shape management. Recent model-based methods for this task require additional information for training, while popular view-based methods often overlook viewpoint variance in view prediction, leading to uninformative 3D features that limit their practical applications. To address these issues, we propose an unsupervised 3D shape representation learning method called View Sequence Prediction GAN (VSP-GAN), which decomposes view content and viewpoint variance. VSP-GAN takes several adjacent views of a 3D shape as input and outputs the subsequent views. The key idea is to split the multi-view sequence into two available perceptible parts, view content and viewpoint variance, and independently encode them with separate encoders. With the information, we design a decoder implemented by the mirrored architecture of the content encoder to predict the view sequence by multi-steps. Besides, to improve the quality of the reconstructed views, we propose a novel hierarchical view prediction loss to enhance view realism, semantic consistency, and details retainment. We evaluate the proposed VSP-GAN on two popular 3D CAD datasets, ModelNet10 and ModelNet40, for 3D shape classification and retrieval. The experimental results demonstrate that our VSP-GAN can learn more discriminative features than the state-of-the-art methods.
{"title":"View sequence prediction GAN: unsupervised representation learning for 3D shapes by decomposing view content and viewpoint variance","authors":"Heyu Zhou, Jiayu Li, Xianzhu Liu, Yingda Lyu, Haipeng Chen, An-An Liu","doi":"10.1007/s00530-024-01431-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01431-8","url":null,"abstract":"<p>Unsupervised representation learning for 3D shapes has become a critical problem for large-scale 3D shape management. Recent model-based methods for this task require additional information for training, while popular view-based methods often overlook viewpoint variance in view prediction, leading to uninformative 3D features that limit their practical applications. To address these issues, we propose an unsupervised 3D shape representation learning method called View Sequence Prediction GAN (VSP-GAN), which decomposes view content and viewpoint variance. VSP-GAN takes several adjacent views of a 3D shape as input and outputs the subsequent views. The key idea is to split the multi-view sequence into two available perceptible parts, view content and viewpoint variance, and independently encode them with separate encoders. With the information, we design a decoder implemented by the mirrored architecture of the content encoder to predict the view sequence by multi-steps. Besides, to improve the quality of the reconstructed views, we propose a novel hierarchical view prediction loss to enhance view realism, semantic consistency, and details retainment. We evaluate the proposed VSP-GAN on two popular 3D CAD datasets, ModelNet10 and ModelNet40, for 3D shape classification and retrieval. The experimental results demonstrate that our VSP-GAN can learn more discriminative features than the state-of-the-art methods.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"45 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1007/s00530-024-01424-7
Deepak Dagar, Dinesh Kumar Vishwakarma
In recent years, artificial faces generated using Generative Adversarial Networks (GANs) and Variational Auto-encoders (VAEs) have become more lifelike and difficult for humans to distinguish. Deepfake refers to highly realistic and impressive media generated using deep learning technology. Convolutional Neural Networks (CNNs) have demonstrated significant potential in computer vision applications, particularly identifying fraudulent faces. However, if these networks are trained on insufficient data, they cannot effectively apply their knowledge to unfamiliar datasets, as they are susceptible to inherent biases in their learning process, such as translation, equivariance, and localization. The attention mechanism of vision transformers has effectively resolved these limits, leading to their growing popularity in recent years. This work introduces a novel module for extracting global texture information and a model that combines data from CNN (ResNet-18) and cross-attention vision transformers. The model takes in input and generates the global texture by utilizing Gram matrices and local binary patterns at each down sampling step of the ResNet-18 architecture. The ResNet-18 main branch and global texture module operate simultaneously before inputting into the visual transformer’s dual branch’s cross-attention mechanism. Initially, the empirical investigation demonstrates that counterfeit images typically display more uniform textures that are inconsistent across long distances. The model’s performance on the cross-forgery dataset is demonstrated by experiments conducted on various types of GAN images and Faceforensics + + categories. The results show that the model outperforms the scores of many state-of-the-art techniques, achieving an accuracy score of up to 85%. Furthermore, multiple tests are performed on different data samples (FF + +, DFDCPreview, Celeb-Df) that undergo post-processing techniques, including compression, noise addition, and blurring. These studies validate that the model acquires the shared distinguishing characteristics (global texture) that persist across different types of fake picture distributions, and the outcomes of these trials demonstrate that the model is resilient and can be used in many scenarios.
{"title":"Tex-Net: texture-based parallel branch cross-attention generalized robust Deepfake detector","authors":"Deepak Dagar, Dinesh Kumar Vishwakarma","doi":"10.1007/s00530-024-01424-7","DOIUrl":"https://doi.org/10.1007/s00530-024-01424-7","url":null,"abstract":"<p>In recent years, artificial faces generated using Generative Adversarial Networks (GANs) and Variational Auto-encoders (VAEs) have become more lifelike and difficult for humans to distinguish. Deepfake refers to highly realistic and impressive media generated using deep learning technology. Convolutional Neural Networks (CNNs) have demonstrated significant potential in computer vision applications, particularly identifying fraudulent faces. However, if these networks are trained on insufficient data, they cannot effectively apply their knowledge to unfamiliar datasets, as they are susceptible to inherent biases in their learning process, such as translation, equivariance, and localization. The attention mechanism of vision transformers has effectively resolved these limits, leading to their growing popularity in recent years. This work introduces a novel module for extracting global texture information and a model that combines data from CNN (ResNet-18) and cross-attention vision transformers. The model takes in input and generates the global texture by utilizing Gram matrices and local binary patterns at each down sampling step of the ResNet-18 architecture. The ResNet-18 main branch and global texture module operate simultaneously before inputting into the visual transformer’s dual branch’s cross-attention mechanism. Initially, the empirical investigation demonstrates that counterfeit images typically display more uniform textures that are inconsistent across long distances. The model’s performance on the cross-forgery dataset is demonstrated by experiments conducted on various types of GAN images and Faceforensics + + categories. The results show that the model outperforms the scores of many state-of-the-art techniques, achieving an accuracy score of up to 85%. Furthermore, multiple tests are performed on different data samples (FF + +, DFDCPreview, Celeb-Df) that undergo post-processing techniques, including compression, noise addition, and blurring. These studies validate that the model acquires the shared distinguishing characteristics (global texture) that persist across different types of fake picture distributions, and the outcomes of these trials demonstrate that the model is resilient and can be used in many scenarios.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"34 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1007/s00530-024-01433-6
Tong Fu, Liquan Chen, Yuan Gao, Huiyu Fang
Nowadays, convolutional neural network (CNN) is applied to JPEG steganalysis and performs better than traditional methods. However, almost all JPEG steganalysis methods utilize single-path structures, making it challenging to use the extracted noise residuals fully. On the other hand, most existing steganalysis detectors lack a focus on areas where secret information may be hidden. In this research, we present a steganalysis model with a dual-path network and improved coordinate attention to detect adaptive JPEG steganography, mainly including noise extraction, noise aggregation, and classification module. Especially, a dual-path network architecture simultaneously combining the advantages of both residual and dense connection is utilized to explore the hidden features in-depth while preserving the stego signal in the noise extraction module. Then, an improved coordinate attention mechanism is introduced into the noise aggregation module, which helps the network identify the complex texture area more quickly and extract more valuable features. We have verified the validity of some components through extensive ablation experiments with the necessary descriptions. Furthermore, we conducted comparative experiments on BOSSBase and BOWS2, and the experimental results demonstrate that the proposed model achieves the best detection performance compared with other start-of-the-art methods.
{"title":"DCANet: CNN model with dual-path network and improved coordinate attention for JPEG steganalysis","authors":"Tong Fu, Liquan Chen, Yuan Gao, Huiyu Fang","doi":"10.1007/s00530-024-01433-6","DOIUrl":"https://doi.org/10.1007/s00530-024-01433-6","url":null,"abstract":"<p>Nowadays, convolutional neural network (CNN) is applied to JPEG steganalysis and performs better than traditional methods. However, almost all JPEG steganalysis methods utilize single-path structures, making it challenging to use the extracted noise residuals fully. On the other hand, most existing steganalysis detectors lack a focus on areas where secret information may be hidden. In this research, we present a steganalysis model with a dual-path network and improved coordinate attention to detect adaptive JPEG steganography, mainly including noise extraction, noise aggregation, and classification module. Especially, a dual-path network architecture simultaneously combining the advantages of both residual and dense connection is utilized to explore the hidden features in-depth while preserving the stego signal in the noise extraction module. Then, an improved coordinate attention mechanism is introduced into the noise aggregation module, which helps the network identify the complex texture area more quickly and extract more valuable features. We have verified the validity of some components through extensive ablation experiments with the necessary descriptions. Furthermore, we conducted comparative experiments on BOSSBase and BOWS2, and the experimental results demonstrate that the proposed model achieves the best detection performance compared with other start-of-the-art methods.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"44 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1007/s00530-024-01409-6
Yanfei Zhu, Yonghua Wang, Chunhui Li, Kwang Y. Lee
In logistics transportation, the electric vehicle routing problem (EVRP) is researched widely in order to save vehicle power expenditure, reduce transportation costs, and improve service quality. The power expenditure model and routing algorithm are essential for resolving EVRP. To align the routing schedule more reasonable and closer to reality, this paper employs a three-dimensional power expenditure model to calculate the power expenditure of EVs. In this model, the power expenditure of the EVs during the process of going up and downhill is considered to solve the routing schedule of logistics transportation in mountainous areas. This study combines Q-learning and the Re-insertion Genetic Algorithm (Q-RIGA) to design EV routes with low electricity expenditure and reduced transportation costs. The Q-learning algorithm is used to improve route initialization and obtain high-quality initial routes, which are further optimized by RIGA. Tested in a collection of randomly dispersed customer groups, the advantages of the proposed method in terms of convergence speed and power expenditure are confirmed. The three-dimensional power expenditure model with consideration of elevation is used to conduct simulation experiments on the distribution example of Sanlian Dairy in Guizhou to verify that the improved model features broader application and higher practical value.
{"title":"Electric vehicle routing optimization under 3D electric energy modeling","authors":"Yanfei Zhu, Yonghua Wang, Chunhui Li, Kwang Y. Lee","doi":"10.1007/s00530-024-01409-6","DOIUrl":"https://doi.org/10.1007/s00530-024-01409-6","url":null,"abstract":"<p>In logistics transportation, the electric vehicle routing problem (EVRP) is researched widely in order to save vehicle power expenditure, reduce transportation costs, and improve service quality. The power expenditure model and routing algorithm are essential for resolving EVRP. To align the routing schedule more reasonable and closer to reality, this paper employs a three-dimensional power expenditure model to calculate the power expenditure of EVs. In this model, the power expenditure of the EVs during the process of going up and downhill is considered to solve the routing schedule of logistics transportation in mountainous areas. This study combines Q-learning and the Re-insertion Genetic Algorithm (Q-RIGA) to design EV routes with low electricity expenditure and reduced transportation costs. The Q-learning algorithm is used to improve route initialization and obtain high-quality initial routes, which are further optimized by RIGA. Tested in a collection of randomly dispersed customer groups, the advantages of the proposed method in terms of convergence speed and power expenditure are confirmed. The three-dimensional power expenditure model with consideration of elevation is used to conduct simulation experiments on the distribution example of Sanlian Dairy in Guizhou to verify that the improved model features broader application and higher practical value.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"731 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s00530-024-01418-5
Xuezhi Xiang, Yu Cui, Xi Wang, Mingliang Zhai, Abdulmotaleb El Saddik
Monocular scene flow estimation is a task that allows us to obtain 3D structure and 3D motion from consecutive monocular images. Previous monocular scene flow usually focused on the enhancement of image features and motion features directly while neglecting the utilization of motion features and image features in the decoder, which are equally crucial for accurate scene flow estimation. Based on the cross-covariance attention, we propose a global feature perception module (GFPM) and applie it to the decoder, which enables the decoder to utilize the motion features and image features of the current layer as well as the coarse estimation result of the scene flow of the previous layer effectively, thus enhancing the decoder’s recovery of 3D motion information. In addition, we also propose a parallel architecture of self-attention and convolution (PCSA) for feature extraction, which can enhance the global expression ability of extracted image features. Our proposed method demonstrates remarkable performance on the KITTI 2015 dataset, achieving a relative improvement of 17.6% compared to the baseline approach. Compared to other recent methods, the proposed model achieves competitive results.
{"title":"GloFP-MSF: monocular scene flow estimation with global feature perception","authors":"Xuezhi Xiang, Yu Cui, Xi Wang, Mingliang Zhai, Abdulmotaleb El Saddik","doi":"10.1007/s00530-024-01418-5","DOIUrl":"https://doi.org/10.1007/s00530-024-01418-5","url":null,"abstract":"<p>Monocular scene flow estimation is a task that allows us to obtain 3D structure and 3D motion from consecutive monocular images. Previous monocular scene flow usually focused on the enhancement of image features and motion features directly while neglecting the utilization of motion features and image features in the decoder, which are equally crucial for accurate scene flow estimation. Based on the cross-covariance attention, we propose a global feature perception module (GFPM) and applie it to the decoder, which enables the decoder to utilize the motion features and image features of the current layer as well as the coarse estimation result of the scene flow of the previous layer effectively, thus enhancing the decoder’s recovery of 3D motion information. In addition, we also propose a parallel architecture of self-attention and convolution (PCSA) for feature extraction, which can enhance the global expression ability of extracted image features. Our proposed method demonstrates remarkable performance on the KITTI 2015 dataset, achieving a relative improvement of 17.6% compared to the baseline approach. Compared to other recent methods, the proposed model achieves competitive results.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"50 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s00530-024-01421-w
Qionghao Huang, Jili Chen, Changqin Huang, Xiaodi Huang, Yi Wang
Significant advancements in multimodal sentiment analysis tasks have been achieved through cross-modal attention mechanisms (CMA). However, the importance of modality-specific information for distinguishing similar samples is often overlooked due to the inherent limitations of CMA. To address this issue, we propose a Text-centered Cross-sample Fusion Network (TeCaFN), which employs cross-sample fusion to perceive modality-specific information during modal fusion. Specifically, we develop a cross-sample fusion method that merges modalities from distinct samples. This method maintains detailed modality-specific information through the use of adversarial training combined with a task of pairwise prediction. Furthermore, a robust mechanism using a two-stage text-centric contrastive learning approach is developed to enhance the stability of cross-sample fusion learning. TeCaFN achieves state-of-the-art results on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Moreover, our ablation studies further demonstrate the effectiveness of contrastive learning and adversarial training as the components of TeCaFN in improving model performance. The code implementation of this paper is available at https://github.com/TheShy-Dream/MSA-TeCaFN.
{"title":"Text-centered cross-sample fusion network for multimodal sentiment analysis","authors":"Qionghao Huang, Jili Chen, Changqin Huang, Xiaodi Huang, Yi Wang","doi":"10.1007/s00530-024-01421-w","DOIUrl":"https://doi.org/10.1007/s00530-024-01421-w","url":null,"abstract":"<p>Significant advancements in multimodal sentiment analysis tasks have been achieved through cross-modal attention mechanisms (CMA). However, the importance of modality-specific information for distinguishing similar samples is often overlooked due to the inherent limitations of CMA. To address this issue, we propose a <b>T</b>ext-c<b>e</b>ntered <b>C</b>ross-s<b>a</b>mple <b>F</b>usion <b>N</b>etwork (TeCaFN), which employs cross-sample fusion to perceive modality-specific information during modal fusion. Specifically, we develop a cross-sample fusion method that merges modalities from distinct samples. This method maintains detailed modality-specific information through the use of adversarial training combined with a task of pairwise prediction. Furthermore, a robust mechanism using a two-stage text-centric contrastive learning approach is developed to enhance the stability of cross-sample fusion learning. TeCaFN achieves state-of-the-art results on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Moreover, our ablation studies further demonstrate the effectiveness of contrastive learning and adversarial training as the components of TeCaFN in improving model performance. The code implementation of this paper is available at https://github.com/TheShy-Dream/MSA-TeCaFN.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"22 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141865991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-29DOI: 10.1007/s00530-024-01417-6
Jiabao Ma, Wujie Zhou, Meixin Fang, Ting Luo
Free-space detection is an essential task in autonomous driving; it can be formulated as the semantic segmentation of driving scenes. An important line of research in free-space detection is the use of convolutional neural networks to achieve high-accuracy semantic segmentation. In this study, we introduce two fusion modules: the dense exploration module (DEM) and the dual-attention exploration module (DAEM). They efficiently capture diverse fusion information by fully exploring deep and representative information at each network stage. Furthermore, we propose a dense multimodal fusion transfer network (DMFTNet). This architecture uses elaborate multimodal deep fusion exploration modules to extract fused features from red–green–blue and depth features at every stage with the help of DEM and DAEM and then densely transfer them to predict the free space. Extensive experiments were conducted comparing DMFTNet and 11 state-of-the-art approaches on two datasets. The proposed fusion module ensured that DMFTNet’s free-space-detection performance was superior.
{"title":"DMFTNet: dense multimodal fusion transfer network for free-space detection","authors":"Jiabao Ma, Wujie Zhou, Meixin Fang, Ting Luo","doi":"10.1007/s00530-024-01417-6","DOIUrl":"https://doi.org/10.1007/s00530-024-01417-6","url":null,"abstract":"<p>Free-space detection is an essential task in autonomous driving; it can be formulated as the semantic segmentation of driving scenes. An important line of research in free-space detection is the use of convolutional neural networks to achieve high-accuracy semantic segmentation. In this study, we introduce two fusion modules: the dense exploration module (DEM) and the dual-attention exploration module (DAEM). They efficiently capture diverse fusion information by fully exploring deep and representative information at each network stage. Furthermore, we propose a dense multimodal fusion transfer network (DMFTNet). This architecture uses elaborate multimodal deep fusion exploration modules to extract fused features from red–green–blue and depth features at every stage with the help of DEM and DAEM and then densely transfer them to predict the free space. Extensive experiments were conducted comparing DMFTNet and 11 state-of-the-art approaches on two datasets. The proposed fusion module ensured that DMFTNet’s free-space-detection performance was superior.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"1 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-29DOI: 10.1007/s00530-024-01416-7
Shuyi Li, Xiaohan Yang, Guozhen Cheng, Wenyan Liu, Hongchao Hu
Adversarial training of lightweight models faces poor effectiveness problem due to the limited model size and the difficult optimization of loss with hard labels. Adversarial distillation is a potential solution to the problem, in which the knowledge from large adversarially pre-trained teachers is used to guide the lightweight models’ learning. However, adversarially pre-training teachers is computationally expensive due to the need for iterative gradient steps concerning the inputs. Additionally, the reliability of guidance from teachers diminishes as lightweight models become more robust. In this paper, we propose an adversarial distillation method called Sample-Adaptive Multi-teacher Dynamic Rectification Adversarial Distillation (SA-MDRAD). First, an adversarial distillation framework of distilling logits and features from the heterogeneous standard pre-trained teachers is developed to reduce pre-training expenses and improve knowledge diversity. Second, the knowledge of teachers is distilled into the lightweight model after sample-aware dynamic rectification and adaptive fusion based on teachers’ predictions to improve the reliability of knowledge. Experiments are conducted to evaluate the performance of the proposed method on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. The results demonstrate that our SA-MDRAD is more effective than existing adversarial distillation methods in enhancing the robustness of lightweight image classification models against various adversarial attacks.
{"title":"SA-MDRAD: sample-adaptive multi-teacher dynamic rectification adversarial distillation","authors":"Shuyi Li, Xiaohan Yang, Guozhen Cheng, Wenyan Liu, Hongchao Hu","doi":"10.1007/s00530-024-01416-7","DOIUrl":"https://doi.org/10.1007/s00530-024-01416-7","url":null,"abstract":"<p>Adversarial training of lightweight models faces poor effectiveness problem due to the limited model size and the difficult optimization of loss with hard labels. Adversarial distillation is a potential solution to the problem, in which the knowledge from large adversarially pre-trained teachers is used to guide the lightweight models’ learning. However, adversarially pre-training teachers is computationally expensive due to the need for iterative gradient steps concerning the inputs. Additionally, the reliability of guidance from teachers diminishes as lightweight models become more robust. In this paper, we propose an adversarial distillation method called Sample-Adaptive Multi-teacher Dynamic Rectification Adversarial Distillation (SA-MDRAD). First, an adversarial distillation framework of distilling logits and features from the heterogeneous standard pre-trained teachers is developed to reduce pre-training expenses and improve knowledge diversity. Second, the knowledge of teachers is distilled into the lightweight model after sample-aware dynamic rectification and adaptive fusion based on teachers’ predictions to improve the reliability of knowledge. Experiments are conducted to evaluate the performance of the proposed method on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. The results demonstrate that our SA-MDRAD is more effective than existing adversarial distillation methods in enhancing the robustness of lightweight image classification models against various adversarial attacks.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"65 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-29DOI: 10.1007/s00530-024-01388-8
V. S. Prasanth, A. Mary Posonia, A. Parveen Akhther
Wireless sensor networks (WSNs) play a very important role in providing real-time data access for big data and internet of things applications. Despite this, WSNs’ open deployment makes them highly susceptible to various malicious attacks, energy constraints, and decentralized governance. For mission-critical applications in WSNs, it is crucial to identify rogue sensor devices and remove the sensed data they contain. The resource-constrained nature of sensor devices prevents the direct application of standard cryptography and authentication techniques in WSNs. Low latency and energy-efficient methods are therefore needed. An efficient and safe routing system is created in this study. Initially the outliers are detected from deployed nodes using stacking based ensemble learning approach. Deep neural network (DNN) and long short term memory (LSTM) are two different basic classifiers and multilayer perceptron (MLP) is utilized as a Meta classifier in the ensemble method. The normal nodes are considered for further process. Then, distance, density and residual energy based cluster head selection and cluster formations are done. Sunflower optimization algorithm (SOA) is employed in this approach for routing purpose to improve energy efficiency and load balancing. Superior transmission routing can potentially obtained by taking the shortest way. This proposed method achieves 95% accuracy for the intrusion detection phase and 92% is the packet delivery ratio for energy efficient routing. Consequently, the proposed method is the most effective option for load balancing with intrusion detection.
{"title":"Effective ensemble based intrusion detection and energy efficient load balancing using sunflower optimization in distributed wireless sensor network","authors":"V. S. Prasanth, A. Mary Posonia, A. Parveen Akhther","doi":"10.1007/s00530-024-01388-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01388-8","url":null,"abstract":"<p>Wireless sensor networks (WSNs) play a very important role in providing real-time data access for big data and internet of things applications. Despite this, WSNs’ open deployment makes them highly susceptible to various malicious attacks, energy constraints, and decentralized governance. For mission-critical applications in WSNs, it is crucial to identify rogue sensor devices and remove the sensed data they contain. The resource-constrained nature of sensor devices prevents the direct application of standard cryptography and authentication techniques in WSNs. Low latency and energy-efficient methods are therefore needed. An efficient and safe routing system is created in this study. Initially the outliers are detected from deployed nodes using stacking based ensemble learning approach. Deep neural network (DNN) and long short term memory (LSTM) are two different basic classifiers and multilayer perceptron (MLP) is utilized as a Meta classifier in the ensemble method. The normal nodes are considered for further process. Then, distance, density and residual energy based cluster head selection and cluster formations are done. Sunflower optimization algorithm (SOA) is employed in this approach for routing purpose to improve energy efficiency and load balancing. Superior transmission routing can potentially obtained by taking the shortest way. This proposed method achieves 95% accuracy for the intrusion detection phase and 92% is the packet delivery ratio for energy efficient routing. Consequently, the proposed method is the most effective option for load balancing with intrusion detection.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"8 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}