Pub Date : 2024-09-19DOI: 10.1007/s00530-024-01485-8
Xinyi Chang, Zhen Wang, Wenhao Liu, Limeng Gao, Bingshuai Yan
Generalized zero-shot learning (GZSL) can classify both seen and unseen class samples, which plays a significant role in practical applications such as emerging species recognition and medical image recognition. However, most existing GZSL methods directly use the pre-trained deep model to learn the image feature. Due to the data distribution inconsistency between the GZSL dataset and the pre-training dataset, the obtained image features have an inferior performance. The distribution of different class image features is similar, which makes them difficult to distinguish. To solve this problem, we propose a dual-path feature enhancement (DPFE) model, which consists of four modules: the feature generation network (FGN), the local fine-grained feature enhancement (LFFE) module, the global coarse-grained feature enhancement (GCFE) module, and the feedback module (FM). The feature generation network can synthesize unseen class image features. We enhance the image features’ discriminative and semantic relevance from both local and global perspectives. To focus on the image’s local discriminative regions, the LFFE module processes the image in blocks and minimizes the semantic cycle-consistency loss to ensure that the region block features contain key classification semantic information. To prevent information loss caused by image blocking, we design the GCFE module. It ensures the consistency between the global image features and the semantic centers, thereby improving the discriminative power of the features. In addition, the feedback module feeds the discriminator network’s middle layer information back to the generator network. As a result, the synthesized image features are more similar to the real features. Experimental results demonstrate that the proposed DPFE method outperforms the state-of-the-arts on four zero-shot learning benchmark datasets.
{"title":"Generating generalized zero-shot learning based on dual-path feature enhancement","authors":"Xinyi Chang, Zhen Wang, Wenhao Liu, Limeng Gao, Bingshuai Yan","doi":"10.1007/s00530-024-01485-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01485-8","url":null,"abstract":"<p>Generalized zero-shot learning (GZSL) can classify both seen and unseen class samples, which plays a significant role in practical applications such as emerging species recognition and medical image recognition. However, most existing GZSL methods directly use the pre-trained deep model to learn the image feature. Due to the data distribution inconsistency between the GZSL dataset and the pre-training dataset, the obtained image features have an inferior performance. The distribution of different class image features is similar, which makes them difficult to distinguish. To solve this problem, we propose a dual-path feature enhancement (DPFE) model, which consists of four modules: the feature generation network (FGN), the local fine-grained feature enhancement (LFFE) module, the global coarse-grained feature enhancement (GCFE) module, and the feedback module (FM). The feature generation network can synthesize unseen class image features. We enhance the image features’ discriminative and semantic relevance from both local and global perspectives. To focus on the image’s local discriminative regions, the LFFE module processes the image in blocks and minimizes the semantic cycle-consistency loss to ensure that the region block features contain key classification semantic information. To prevent information loss caused by image blocking, we design the GCFE module. It ensures the consistency between the global image features and the semantic centers, thereby improving the discriminative power of the features. In addition, the feedback module feeds the discriminator network’s middle layer information back to the generator network. As a result, the synthesized image features are more similar to the real features. Experimental results demonstrate that the proposed DPFE method outperforms the state-of-the-arts on four zero-shot learning benchmark datasets.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"8 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current RGB-D semantic segmentation networks incorporate depth information as an extra modality and merge RGB and depth features using methods such as equal-weighted concatenation or simple fusion strategies. However, these methods hinder the effective utilization of cross-modal information. Aiming at the problem that existing RGB-D semantic segmentation networks fail to fully utilize RGB and depth features, we propose an RGB-D semantic segmentation network, based on triple fusion and feature pyramid decoding, which achieves bidirectional interaction and fusion of RGB and depth features via the proposed three-stage cross-modal fusion module (TCFM). The TCFM proposes utilizing cross-modal cross-attention to intermix the data from two modalities into another modality. It fuses the RGB attributes and depth features proficiently, utilizing the channel-adaptive weighted fusion module. Furthermore, this paper introduces a lightweight feature pyramidal decoder network to fuse the multi-scale parts taken out by the encoder effectively. Experiments on NYU Depth V2 and SUN RGB-D datasets demonstrate that the cross-modal feature fusion network proposed in this study efficiently segments intricate scenes.
{"title":"Triple fusion and feature pyramid decoder for RGB-D semantic segmentation","authors":"Bin Ge, Xu Zhu, Zihan Tang, Chenxing Xia, Yiming Lu, Zhuang Chen","doi":"10.1007/s00530-024-01459-w","DOIUrl":"https://doi.org/10.1007/s00530-024-01459-w","url":null,"abstract":"<p>Current RGB-D semantic segmentation networks incorporate depth information as an extra modality and merge RGB and depth features using methods such as equal-weighted concatenation or simple fusion strategies. However, these methods hinder the effective utilization of cross-modal information. Aiming at the problem that existing RGB-D semantic segmentation networks fail to fully utilize RGB and depth features, we propose an RGB-D semantic segmentation network, based on triple fusion and feature pyramid decoding, which achieves bidirectional interaction and fusion of RGB and depth features via the proposed three-stage cross-modal fusion module (TCFM). The TCFM proposes utilizing cross-modal cross-attention to intermix the data from two modalities into another modality. It fuses the RGB attributes and depth features proficiently, utilizing the channel-adaptive weighted fusion module. Furthermore, this paper introduces a lightweight feature pyramidal decoder network to fuse the multi-scale parts taken out by the encoder effectively. Experiments on NYU Depth V2 and SUN RGB-D datasets demonstrate that the cross-modal feature fusion network proposed in this study efficiently segments intricate scenes.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"38 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-13DOI: 10.1007/s00530-024-01465-y
Zhaorui Liu, Hao Chen, Caiyin Tang, Quan Li, Tao Peng
Automatic segmentation and lymph node (LN) detection for cancer staging are critical. In clinical practice, computed tomography (CT) and positron emission tomography (PET) imaging detect abnormal LNs. Yet, it is still a difficult task due to the low contrast of LNs and surrounding soft tissues and the variation in nodal size and shape. We designed a location-guided 3D dual network for LN segmentation. A localization module generates Gaussian masks focused on LNs centralized within selected regions of interest (ROI). Our segmentation model incorporated squeeze & excitation (SE) and attention gate (AG) modules into a conventional 3D UNet architecture to boost useful feature utilization and increase usable feature utilization and segmentation accuracy. Lastly, we provide a simple boundary refinement module to polish the outcomes. We assessed the location-guided LN segmentation network’s performance on a clinical dataset with head and neck cancer. The location-guided network outperformed a comparable architecture without the Gaussian mask in terms of performance.
{"title":"Automatic lymph node segmentation using deep parallel squeeze & excitation and attention Unet","authors":"Zhaorui Liu, Hao Chen, Caiyin Tang, Quan Li, Tao Peng","doi":"10.1007/s00530-024-01465-y","DOIUrl":"https://doi.org/10.1007/s00530-024-01465-y","url":null,"abstract":"<p>Automatic segmentation and lymph node (LN) detection for cancer staging are critical. In clinical practice, computed tomography (CT) and positron emission tomography (PET) imaging detect abnormal LNs. Yet, it is still a difficult task due to the low contrast of LNs and surrounding soft tissues and the variation in nodal size and shape. We designed a location-guided 3D dual network for LN segmentation. A localization module generates Gaussian masks focused on LNs centralized within selected regions of interest (ROI). Our segmentation model incorporated squeeze & excitation (SE) and attention gate (AG) modules into a conventional 3D UNet architecture to boost useful feature utilization and increase usable feature utilization and segmentation accuracy. Lastly, we provide a simple boundary refinement module to polish the outcomes. We assessed the location-guided LN segmentation network’s performance on a clinical dataset with head and neck cancer. The location-guided network outperformed a comparable architecture without the Gaussian mask in terms of performance.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"1 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-13DOI: 10.1007/s00530-024-01466-x
Yaqian Li, Kairan Li, Haibin Li, Wenming Zhang
To address issues such as instability during the training of Generative Adversarial Networks, insufficient clarity in facial structure restoration, inadequate utilization of known information, and lack of attention to color information in images, a Cross-Attention Restoration Network is proposed. Initially, in the decoding part of the basic first-stage U-Net network, a combination of sub-pixel convolution and upsampling modules is employed to remedy the low-quality image restoration issue associated with single upsampling in the image recovery process. Subsequently, the restoration part of the first-stage network and the un-restored images are used to compute cross-attention in both spatial and channel dimensions, recovering the complete facial restoration image from the known repaired information. At the same time, we propose a loss function based on HSV space, assigning appropriate weights within the function to significantly improve the color aspects of the image. Compared to classical methods, this model exhibits good performance in terms of peak signal-to-noise ratio, structural similarity, and FID.
{"title":"CAFIN: cross-attention based face image repair network","authors":"Yaqian Li, Kairan Li, Haibin Li, Wenming Zhang","doi":"10.1007/s00530-024-01466-x","DOIUrl":"https://doi.org/10.1007/s00530-024-01466-x","url":null,"abstract":"<p>To address issues such as instability during the training of Generative Adversarial Networks, insufficient clarity in facial structure restoration, inadequate utilization of known information, and lack of attention to color information in images, a Cross-Attention Restoration Network is proposed. Initially, in the decoding part of the basic first-stage U-Net network, a combination of sub-pixel convolution and upsampling modules is employed to remedy the low-quality image restoration issue associated with single upsampling in the image recovery process. Subsequently, the restoration part of the first-stage network and the un-restored images are used to compute cross-attention in both spatial and channel dimensions, recovering the complete facial restoration image from the known repaired information. At the same time, we propose a loss function based on HSV space, assigning appropriate weights within the function to significantly improve the color aspects of the image. Compared to classical methods, this model exhibits good performance in terms of peak signal-to-noise ratio, structural similarity, and FID.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"191 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-11DOI: 10.1007/s00530-024-01478-7
Junmin Zhong, Anzhi Wang, Chunhong Ren, Jintao Wu
Camouflaged object detection (COD) is an emerging visual detection task that aims to identify objects that conceal themselves in the surrounding environment. The high intrinsic similarities between the camouflaged objects and their backgrounds make COD far more challenging than traditional object detection. Recently, COD has attracted increasing research interest in the computer vision community, and numerous deep learning-based methods have been proposed, showing great potential. However, most of the existing work focuses on analyzing the structure of COD models, with few overview works summarizing deep learning-based models. To address this gap, we provide a comprehensive analysis and summary of deep learning-based COD models. Specifically, we first classify 48 deep learning-based COD models and analyze their advantages and disadvantages. Second, we introduce widely available datasets for COD and performance evaluation metrics. Then, we evaluate the performance of existing deep learning-based COD models on these four datasets. Finally, we indicate relevant applications and discuss challenges and future research directions for the COD task.
{"title":"A survey on deep learning-based camouflaged object detection","authors":"Junmin Zhong, Anzhi Wang, Chunhong Ren, Jintao Wu","doi":"10.1007/s00530-024-01478-7","DOIUrl":"https://doi.org/10.1007/s00530-024-01478-7","url":null,"abstract":"<p>Camouflaged object detection (COD) is an emerging visual detection task that aims to identify objects that conceal themselves in the surrounding environment. The high intrinsic similarities between the camouflaged objects and their backgrounds make COD far more challenging than traditional object detection. Recently, COD has attracted increasing research interest in the computer vision community, and numerous deep learning-based methods have been proposed, showing great potential. However, most of the existing work focuses on analyzing the structure of COD models, with few overview works summarizing deep learning-based models. To address this gap, we provide a comprehensive analysis and summary of deep learning-based COD models. Specifically, we first classify 48 deep learning-based COD models and analyze their advantages and disadvantages. Second, we introduce widely available datasets for COD and performance evaluation metrics. Then, we evaluate the performance of existing deep learning-based COD models on these four datasets. Finally, we indicate relevant applications and discuss challenges and future research directions for the COD task.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"24 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-11DOI: 10.1007/s00530-024-01472-z
Yuhe Fan, Lixun Zhang, Canxing Zheng, Xingyuan Wang, Jinghui Zhu, Lan Wang
Instance segmentation of faces and mouth-opening degrees is an important technology for meal-assisting robotics in food delivery safety. However, due to the diversity in in shape, color, and posture of faces and the mouth with small area contour, easy to deform, and occluded, it is challenging to real-time and accurate instance segmentation. In this paper, we proposed a novel method for instance segmentation of faces and mouth-opening degrees. Specifically, in backbone network, deformable convolution was introduced to enhance the ability to capture finer-grained spatial information and the CloFormer module was introduced to improve the ability to capture high-frequency local and low-frequency global information. In neck network, classical convolution and C2f modules are replaced by GSConv and VoV-GSCSP aggregation modules, respectively, to reduce the complexity and floating-point operations of models. Finally, in localization loss, CIOU loss was replaced by WIOU loss to reduce the competitiveness of high-quality anchor frames and mask the influence of low-quality samples, which in turn improves localization accuracy and generalization ability. It is abbreviated as the DCGW-YOLOv8n-seg model. The DCGW-YOLOv8n-seg model was compared with the baseline YOLOv8n-seg model and several state-of-the-art instance segmentation models on datasets, respectively. The results show that the DCGW-YOLOv8n-seg model is characterized by high accuracy, speed, robustness, and generalization ability. The effectiveness of each improvement in improving the model performance was verified by ablation experiments. Finally, the DCGW-YOLOv8n-seg model was applied to the instance segmentation experiment of meal-assisting robotics. The results show that the DCGW-YOLOv8n-seg model can better realize the instance segmentation effect of faces and mouth-opening degrees. The novel method proposed can provide a guiding theoretical basis for meal-assisting robotics in food delivery safety and can provide a reference value for computer vision and image instance segmentation.
{"title":"Instance segmentation of faces and mouth-opening degrees based on improved YOLOv8 method","authors":"Yuhe Fan, Lixun Zhang, Canxing Zheng, Xingyuan Wang, Jinghui Zhu, Lan Wang","doi":"10.1007/s00530-024-01472-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01472-z","url":null,"abstract":"<p>Instance segmentation of faces and mouth-opening degrees is an important technology for meal-assisting robotics in food delivery safety. However, due to the diversity in in shape, color, and posture of faces and the mouth with small area contour, easy to deform, and occluded, it is challenging to real-time and accurate instance segmentation. In this paper, we proposed a novel method for instance segmentation of faces and mouth-opening degrees. Specifically, in backbone network, deformable convolution was introduced to enhance the ability to capture finer-grained spatial information and the CloFormer module was introduced to improve the ability to capture high-frequency local and low-frequency global information. In neck network, classical convolution and C2f modules are replaced by GSConv and VoV-GSCSP aggregation modules, respectively, to reduce the complexity and floating-point operations of models. Finally, in localization loss, CIOU loss was replaced by WIOU loss to reduce the competitiveness of high-quality anchor frames and mask the influence of low-quality samples, which in turn improves localization accuracy and generalization ability. It is abbreviated as the DCGW-YOLOv8n-seg model. The DCGW-YOLOv8n-seg model was compared with the baseline YOLOv8n-seg model and several state-of-the-art instance segmentation models on datasets, respectively. The results show that the DCGW-YOLOv8n-seg model is characterized by high accuracy, speed, robustness, and generalization ability. The effectiveness of each improvement in improving the model performance was verified by ablation experiments. Finally, the DCGW-YOLOv8n-seg model was applied to the instance segmentation experiment of meal-assisting robotics. The results show that the DCGW-YOLOv8n-seg model can better realize the instance segmentation effect of faces and mouth-opening degrees. The novel method proposed can provide a guiding theoretical basis for meal-assisting robotics in food delivery safety and can provide a reference value for computer vision and image instance segmentation.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1007/s00530-024-01476-9
Weina Dong, Jia Liu, Lifeng Chen, Wenquan Sun, Xiaozhong Pan, Yan Ke
Recently, implicit neural representation (INR) has started to be applied in image steganography. However, the quality of stego and secret images represented by INR is generally low. In this paper, we propose an implicit neural representation steganography method by neuron pruning. Initially, we randomly deactivate a portion of neurons to train an INR function for implicitly representing the secret image. Subsequently, we prune the neurons that are deemed unimportant for representing the secret image in a unstructured manner to obtain a secret function, while marking the positions of neurons as the key. Finally, based on a partial optimization strategy, we reactivate the pruned neurons to construct a stego function for representing the cover image. The recipient only needs the shared key to recover the secret function from the stego function in order to reconstruct the secret image. Experimental results demonstrate that this method not only allows for lossless recovery of the secret image, but also performs well in terms of capacity, fidelity, and undetectability. The experiments conducted on images of different resolutions validate that our proposed method exhibits significant advantages in image quality over existing implicit representation steganography methods.
{"title":"Implicit neural representation steganography by neuron pruning","authors":"Weina Dong, Jia Liu, Lifeng Chen, Wenquan Sun, Xiaozhong Pan, Yan Ke","doi":"10.1007/s00530-024-01476-9","DOIUrl":"https://doi.org/10.1007/s00530-024-01476-9","url":null,"abstract":"<p>Recently, implicit neural representation (INR) has started to be applied in image steganography. However, the quality of stego and secret images represented by INR is generally low. In this paper, we propose an implicit neural representation steganography method by neuron pruning. Initially, we randomly deactivate a portion of neurons to train an INR function for implicitly representing the secret image. Subsequently, we prune the neurons that are deemed unimportant for representing the secret image in a unstructured manner to obtain a secret function, while marking the positions of neurons as the key. Finally, based on a partial optimization strategy, we reactivate the pruned neurons to construct a stego function for representing the cover image. The recipient only needs the shared key to recover the secret function from the stego function in order to reconstruct the secret image. Experimental results demonstrate that this method not only allows for lossless recovery of the secret image, but also performs well in terms of capacity, fidelity, and undetectability. The experiments conducted on images of different resolutions validate that our proposed method exhibits significant advantages in image quality over existing implicit representation steganography methods.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"258 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
People process things and express feelings through actions, action recognition has been able to be widely studied, yet under-explored. Traditional self-supervised skeleton-based action recognition focus on joint point features, ignoring the inherent semantic information of body structures at different scales. To address this problem, we propose a multi-scale Motion Contrastive Learning of Visual Representations (MsMCLR) model. The model utilizes the Multi-scale Motion Attention (MsM Attention) module to divide the skeletal features into three scale levels, extracting cross-frame and cross-node motion features from them. To obtain more motion patterns, a combination of strong data augmentation is used in the proposed model, which motivates the model to utilize more motion features. However, the feature sequences generated by strong data augmentation make it difficult to maintain identity of the original sequence. Hence, we introduce a dual distributional divergence minimization method, proposing a multi-scale motion loss function. It utilizes the embedding distribution of the ordinary augmentation branch to supervise the loss computation of the strong augmentation branch. Finally, the proposed method is evaluated on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The accuracy of our method is 1.4–3.0% higher than the frontier models.
{"title":"Multi-scale motion contrastive learning for self-supervised skeleton-based action recognition","authors":"Yushan Wu, Zengmin Xu, Mengwei Yuan, Tianchi Tang, Ruxing Meng, Zhongyuan Wang","doi":"10.1007/s00530-024-01463-0","DOIUrl":"https://doi.org/10.1007/s00530-024-01463-0","url":null,"abstract":"<p>People process things and express feelings through actions, action recognition has been able to be widely studied, yet under-explored. Traditional self-supervised skeleton-based action recognition focus on joint point features, ignoring the inherent semantic information of body structures at different scales. To address this problem, we propose a multi-scale Motion Contrastive Learning of Visual Representations (MsMCLR) model. The model utilizes the Multi-scale Motion Attention (MsM Attention) module to divide the skeletal features into three scale levels, extracting cross-frame and cross-node motion features from them. To obtain more motion patterns, a combination of strong data augmentation is used in the proposed model, which motivates the model to utilize more motion features. However, the feature sequences generated by strong data augmentation make it difficult to maintain identity of the original sequence. Hence, we introduce a dual distributional divergence minimization method, proposing a multi-scale motion loss function. It utilizes the embedding distribution of the ordinary augmentation branch to supervise the loss computation of the strong augmentation branch. Finally, the proposed method is evaluated on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets. The accuracy of our method is 1.4–3.0% higher than the frontier models.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"83 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-09DOI: 10.1007/s00530-024-01473-y
Jing Di, Chan Liang, Li Ren, Wenqing Guo, Jizhao Liu, Jing Lian
In the field of medical image fusion, traditional approaches often fail to differentiate between the unique characteristics of each raw image, leading to fused images with compromised texture and structural clarity. Addressing this, we introduce an advanced multi-branch fusion method characterized by contrast-enhanced features and interactive information exchange. This method integrates a multi-scale residual module and a gradient-dense module within a private branch to precisely extract and enrich texture details from individual raw images. In parallel, a common feature extraction branch, equipped with an information interaction module, processes paired raw images to synergistically capture complementary and shared functional information across modalities. Additionally, we implement a sophisticated attention mechanism tailored for both the private and public branches to enhance global feature extraction, thereby significantly improving the contrast and contour definition of the fused image. A novel correlation consistency loss function further refines the fusion process by optimizing the information sharing between modalities, promoting the correlation among basic cross-modal features while minimizing the correlation of high-frequency details across different modalities. Objective evaluations demonstrate substantial improvements in indices such as EN, MI, QMI, SSIM, AG, SF, and (text {Q}^{text {AB/F}}), with average increases of 23.67%, 12.35%, 4.22%, 20.81%, 8.96%, 6.38%, and 25.36%, respectively. These results underscore our method’s superiority in achieving enhanced texture detail and contrast in fused images compared to conventional algorithms, as validated by both subjective assessments and objective performance metrics.
{"title":"C2IENet: Multi-branch medical image fusion based on contrastive constraint features and information exchange","authors":"Jing Di, Chan Liang, Li Ren, Wenqing Guo, Jizhao Liu, Jing Lian","doi":"10.1007/s00530-024-01473-y","DOIUrl":"https://doi.org/10.1007/s00530-024-01473-y","url":null,"abstract":"<p>In the field of medical image fusion, traditional approaches often fail to differentiate between the unique characteristics of each raw image, leading to fused images with compromised texture and structural clarity. Addressing this, we introduce an advanced multi-branch fusion method characterized by contrast-enhanced features and interactive information exchange. This method integrates a multi-scale residual module and a gradient-dense module within a private branch to precisely extract and enrich texture details from individual raw images. In parallel, a common feature extraction branch, equipped with an information interaction module, processes paired raw images to synergistically capture complementary and shared functional information across modalities. Additionally, we implement a sophisticated attention mechanism tailored for both the private and public branches to enhance global feature extraction, thereby significantly improving the contrast and contour definition of the fused image. A novel correlation consistency loss function further refines the fusion process by optimizing the information sharing between modalities, promoting the correlation among basic cross-modal features while minimizing the correlation of high-frequency details across different modalities. Objective evaluations demonstrate substantial improvements in indices such as EN, MI, QMI, SSIM, AG, SF, and <span>(text {Q}^{text {AB/F}})</span>, with average increases of 23.67%, 12.35%, 4.22%, 20.81%, 8.96%, 6.38%, and 25.36%, respectively. These results underscore our method’s superiority in achieving enhanced texture detail and contrast in fused images compared to conventional algorithms, as validated by both subjective assessments and objective performance metrics.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"19 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, learning-based MVS methods have achieved excellent performance compared with traditional methods. However, these methods still have notable shortcomings, such as the low efficiency of traditional convolutional networks and simple feature fusion, which lead to incomplete reconstruction. In this research, we propose a lightweight network for low-texture scene reconstruction (LLR-MVSNet). To improve accuracy and efficiency, a lightweight network is proposed, including a multi-scale feature extraction module and a weighted feature fusion module. The multi-scale feature extraction module uses depth-separable convolution and point-wise convolution to replace traditional convolution, which can reduce network parameters and improve the model efficiency. In order to improve the fusion accuracy, a weighted feature fusion module is proposed, which can selectively emphasize features, suppress useless information and improve the fusion accuracy. With rapid computational speed and high performance, our method surpasses the state-of-the-art benchmarks and performs well on the DTU and the Tanks & Temples datasets. The code of our method will be made available at https://github.com/wln19/LLR-MVSNet.
{"title":"LLR-MVSNet: a lightweight network for low-texture scene reconstruction","authors":"Lina Wang, Jiangfeng She, Qiang Zhao, Xiang Wen, Qifeng Wan, Shuangpin Wu","doi":"10.1007/s00530-024-01464-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01464-z","url":null,"abstract":"<p>In recent years, learning-based MVS methods have achieved excellent performance compared with traditional methods. However, these methods still have notable shortcomings, such as the low efficiency of traditional convolutional networks and simple feature fusion, which lead to incomplete reconstruction. In this research, we propose a lightweight network for low-texture scene reconstruction (LLR-MVSNet). To improve accuracy and efficiency, a lightweight network is proposed, including a multi-scale feature extraction module and a weighted feature fusion module. The multi-scale feature extraction module uses depth-separable convolution and point-wise convolution to replace traditional convolution, which can reduce network parameters and improve the model efficiency. In order to improve the fusion accuracy, a weighted feature fusion module is proposed, which can selectively emphasize features, suppress useless information and improve the fusion accuracy. With rapid computational speed and high performance, our method surpasses the state-of-the-art benchmarks and performs well on the DTU and the Tanks & Temples datasets. The code of our method will be made available at https://github.com/wln19/LLR-MVSNet.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"114 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}