Noise removal of images is an essential preprocessing procedure for many computer vision tasks. Currently, many denoising models based on deep neural networks can perform well in removing the noise with known distributions (i.e. the additive Gaussian white noise). However eliminating real noise is still a very challenging task, since real-world noise often does not simply follow one single type of distribution, and the noise may spatially vary. In this paper, we present a novel dual convolutional neural network (CNN) with attention for image blind denoising, named as the DCANet. To the best of our knowledge, the proposed DCANet is the first work that integrates both the dual CNN and attention mechanism for image denoising. The DCANet is composed of a noise estimation network, a spatial and channel attention module (SCAM), and a dual CNN. The noise estimation network is utilized to estimate the spatial distribution and the noise level in an image. The noisy image and its estimated noise are combined as the input of the SCAM, and a dual CNN contains two different branches is designed to learn the complementary features to obtain the denoised image. The experimental results have verified that the proposed DCANet can suppress both synthetic and real noise effectively. The code of DCANet is available at https://github.com/WenCongWu/DCANet.
{"title":"Dual convolutional neural network with attention for image blind denoising","authors":"Wencong Wu, Guannan Lv, Yingying Duan, Peng Liang, Yungang Zhang, Yuelong Xia","doi":"10.1007/s00530-024-01469-8","DOIUrl":"https://doi.org/10.1007/s00530-024-01469-8","url":null,"abstract":"<p>Noise removal of images is an essential preprocessing procedure for many computer vision tasks. Currently, many denoising models based on deep neural networks can perform well in removing the noise with known distributions (i.e. the additive Gaussian white noise). However eliminating real noise is still a very challenging task, since real-world noise often does not simply follow one single type of distribution, and the noise may spatially vary. In this paper, we present a novel dual convolutional neural network (CNN) with attention for image blind denoising, named as the DCANet. To the best of our knowledge, the proposed DCANet is the first work that integrates both the dual CNN and attention mechanism for image denoising. The DCANet is composed of a noise estimation network, a spatial and channel attention module (SCAM), and a dual CNN. The noise estimation network is utilized to estimate the spatial distribution and the noise level in an image. The noisy image and its estimated noise are combined as the input of the SCAM, and a dual CNN contains two different branches is designed to learn the complementary features to obtain the denoised image. The experimental results have verified that the proposed DCANet can suppress both synthetic and real noise effectively. The code of DCANet is available at https://github.com/WenCongWu/DCANet.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"13 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1007/s00530-024-01462-1
Mawei Wu, Aiwen Jiang, Hourong Chen, Jihua Ye
Image dehazing aims to restore high fidelity clear images from hazy ones. It has wide applications on many intelligent image analysis systems in computer vision area. Many prior-based and learning-based methods have already made significant progress in this field. However, the domain gap between synthetic and real hazy images still negatively impacts model’s generalization performance in real-world scenarios. In this paper, we have proposed an effective physical-prior-guided single image dehazing network via unpaired contrastive learning (PDUNet). The learning process of PDUNet consists of pre-training stage on synthetic data and fine-tuning stage on real data. Mixed-prior modules, controllable zero-convolution modules, and unpaired contrastive regularization with hybrid transmission maps have been proposed to fully utilize complementary advantages of both prior-based and learning-based strategies. Specifically, mixed-prior module provides precise haze distributions. Zero-convolution modules serving as controllable bypass supplement pre-trained model with additional real-world haze information, as well as mitigate the risk of catastrophic forgetting during fine-tuning. Hybrid prior-generated transmission maps are employed for unpaired contrastive regularization. Through leveraging physical prior statistics and vast of unlabel real-data, the proposed PDUNet exhibits excellent generalization and adaptability on handling real-world hazy scenarios. Extensive experiments on public dataset have demonstrated that the proposed method improves PSNR,NIQE and BRISQUE values by an average of 0.33, 0.69 and 2.3, respectively, with comparable model efficiency compared to SOTA. Related codes and model parameters will be publicly available on Github https://github.com/Jotra9872/PDU-Net.
{"title":"Physical-prior-guided single image dehazing network via unpaired contrastive learning","authors":"Mawei Wu, Aiwen Jiang, Hourong Chen, Jihua Ye","doi":"10.1007/s00530-024-01462-1","DOIUrl":"https://doi.org/10.1007/s00530-024-01462-1","url":null,"abstract":"<p>Image dehazing aims to restore high fidelity clear images from hazy ones. It has wide applications on many intelligent image analysis systems in computer vision area. Many prior-based and learning-based methods have already made significant progress in this field. However, the domain gap between synthetic and real hazy images still negatively impacts model’s generalization performance in real-world scenarios. In this paper, we have proposed an effective physical-prior-guided single image dehazing network via unpaired contrastive learning (PDUNet). The learning process of PDUNet consists of pre-training stage on synthetic data and fine-tuning stage on real data. Mixed-prior modules, controllable zero-convolution modules, and unpaired contrastive regularization with hybrid transmission maps have been proposed to fully utilize complementary advantages of both prior-based and learning-based strategies. Specifically, mixed-prior module provides precise haze distributions. Zero-convolution modules serving as controllable bypass supplement pre-trained model with additional real-world haze information, as well as mitigate the risk of catastrophic forgetting during fine-tuning. Hybrid prior-generated transmission maps are employed for unpaired contrastive regularization. Through leveraging physical prior statistics and vast of unlabel real-data, the proposed PDUNet exhibits excellent generalization and adaptability on handling real-world hazy scenarios. Extensive experiments on public dataset have demonstrated that the proposed method improves PSNR,NIQE and BRISQUE values by an average of 0.33, 0.69 and 2.3, respectively, with comparable model efficiency compared to SOTA. Related codes and model parameters will be publicly available on Github https://github.com/Jotra9872/PDU-Net.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"44 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1007/s00530-024-01470-1
Yao Zhang, Zijie Song, Zhenzhen Hu
Text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. Text-based image contains both textual and visual information, which is difficult to be described comprehensively. Recent works fail to adequately model the relationship between features of different modalities and fine-grained alignment. Due to the multimodal characteristics of scene texts, the representations of text usually come from multiple encoders of visual and textual, leading to heterogeneous features. Though lots of works have paid attention to fuse features from different sources, they ignore the direct correlation between heterogeneous features, and the coherence in scene text has not been fully exploited. In this paper, we propose Heterogeneous Attention Module (HAM) to enhance the cross-modal representations of OCR tokens and devote it to text-based image captioning. The HAM is designed to capture the coherence between different modalities of OCR tokens and provide context-aware scene text representations to generate accurate image captions. To the best of our knowledge, we are the first to apply the heterogeneous attention mechanism to explore the coherence in OCR tokens for text-based image captioning. By calculating the heterogeneous similarity, we interactively enhance the alignment between visual and textual information in OCR. We conduct the experiments on the TextCaps dataset. Under the same setting, the results show that our model achieves competitive performances compared with the advanced methods and ablation study demonstrates that our framework enhances the original model in all metrics.
{"title":"Exploring coherence from heterogeneous representations for OCR image captioning","authors":"Yao Zhang, Zijie Song, Zhenzhen Hu","doi":"10.1007/s00530-024-01470-1","DOIUrl":"https://doi.org/10.1007/s00530-024-01470-1","url":null,"abstract":"<p>Text-based image captioning is an important task, aiming to generate descriptions based on reading and reasoning the scene texts in images. Text-based image contains both textual and visual information, which is difficult to be described comprehensively. Recent works fail to adequately model the relationship between features of different modalities and fine-grained alignment. Due to the multimodal characteristics of scene texts, the representations of text usually come from multiple encoders of visual and textual, leading to heterogeneous features. Though lots of works have paid attention to fuse features from different sources, they ignore the direct correlation between heterogeneous features, and the coherence in scene text has not been fully exploited. In this paper, we propose Heterogeneous Attention Module (HAM) to enhance the cross-modal representations of OCR tokens and devote it to text-based image captioning. The HAM is designed to capture the coherence between different modalities of OCR tokens and provide context-aware scene text representations to generate accurate image captions. To the best of our knowledge, we are the first to apply the heterogeneous attention mechanism to explore the coherence in OCR tokens for text-based image captioning. By calculating the heterogeneous similarity, we interactively enhance the alignment between visual and textual information in OCR. We conduct the experiments on the TextCaps dataset. Under the same setting, the results show that our model achieves competitive performances compared with the advanced methods and ablation study demonstrates that our framework enhances the original model in all metrics.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"49 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1007/s00530-024-01452-3
Chang Su, Yong Han, Suihao Lu, Dongsheng Jiang
In Industry 4.0 and advanced manufacturing, producing high-precision, complex products such as aero-engine blades involves sophisticated processes. Digital twin technology enables the creation of high-precision, real-time 3D models, optimizing manufacturing processes and improving product qualification rates. Establishing geometric models is crucial for effective digital twins. Traditional methods often fail to meet precision and efficiency demands. This paper proposes a fitting method based on an improved sparrow search algorithm (SSA) for high-precision curve fitting with minimal control points. This enhances modeling precision and efficiency, creating models suitable for digital twin environments and improving machining qualification rates. The SSA’s position update function is enhanced, and an internal node vector update range prevents premature convergence and improves global search capabilities. Through automatic iterations, optimal control points are calculated using the least squares method. Fitness values, based on local and global errors, are iteratively calculated to achieve target accuracy. Validation with aero-engine blade data showed fitting accuracies of 1e−3 mm and 1e−5 mm. Efficiency in searching for minimal control points improved by 34.7%–49.6% compared to traditional methods. This SSA-based fitting method significantly advances geometric modeling precision and efficiency, addressing modern manufacturing challenges with high-quality, real-time production capabilities.
{"title":"Adaptive B-spline curve fitting with minimal control points using an improved sparrow search algorithm for geometric modeling of aero-engine blades","authors":"Chang Su, Yong Han, Suihao Lu, Dongsheng Jiang","doi":"10.1007/s00530-024-01452-3","DOIUrl":"https://doi.org/10.1007/s00530-024-01452-3","url":null,"abstract":"<p>In Industry 4.0 and advanced manufacturing, producing high-precision, complex products such as aero-engine blades involves sophisticated processes. Digital twin technology enables the creation of high-precision, real-time 3D models, optimizing manufacturing processes and improving product qualification rates. Establishing geometric models is crucial for effective digital twins. Traditional methods often fail to meet precision and efficiency demands. This paper proposes a fitting method based on an improved sparrow search algorithm (SSA) for high-precision curve fitting with minimal control points. This enhances modeling precision and efficiency, creating models suitable for digital twin environments and improving machining qualification rates. The SSA’s position update function is enhanced, and an internal node vector update range prevents premature convergence and improves global search capabilities. Through automatic iterations, optimal control points are calculated using the least squares method. Fitness values, based on local and global errors, are iteratively calculated to achieve target accuracy. Validation with aero-engine blade data showed fitting accuracies of 1e−3 mm and 1e−5 mm. Efficiency in searching for minimal control points improved by 34.7%–49.6% compared to traditional methods. This SSA-based fitting method significantly advances geometric modeling precision and efficiency, addressing modern manufacturing challenges with high-quality, real-time production capabilities.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"11 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1007/s00530-024-01440-7
Maryam Karimi, Mansour Nejati
Taking high quality images at night is a challenging issue for many applications. Therefore, assessing the quality of night-time images (NTIs) is a significant area of research. Since there is no reference image for such images, night-time image quality assessment (NTQA) must be performed blindly. Although the field of blind quality assessment of natural images has gained significant popularity over the past decade, the quality assessment of NTIs usually confront complex distortions such as contrast loss, chroma noise, color desaturation, and detail blur, that have been less investigated. In this paper, a blind night-time image quality evaluation model is proposed by generating innovative quality-aware local feature maps, including detail exposedness, color saturation, sharpness, contrast, and naturalness. In the next step, these maps are compressed and converted into global feature representations using histograms. These feature histograms are used to learn a Gaussian process regression (GPR) quality prediction model. The suggested BIQA approach for night images undergoes a comprehensive evaluation on a standard night image database. The results of the experiments demonstrate the superior prediction performance of the proposed BIQA method for night images compared to other advanced BIQA methods despite its simplicity of implementation and execution speed. Hence, it is readily applicable in real-time scenarios.
在许多应用中,夜间拍摄高质量图像都是一个具有挑战性的问题。因此,评估夜间图像(NTI)的质量是一个重要的研究领域。由于此类图像没有参考图像,因此夜间图像质量评估(NTQA)必须以盲法进行。虽然自然图像的盲质量评估领域在过去十年中得到了极大的普及,但 NTI 的质量评估通常会面临复杂的失真问题,如对比度损失、色度噪声、色彩失饱和以及细节模糊等,对这些问题的研究较少。本文通过生成创新的质量感知局部特征图,包括细节曝光度、色彩饱和度、清晰度、对比度和自然度,提出了一种夜间图像质量盲评估模型。下一步,利用直方图将这些图压缩并转换为全局特征表示。这些特征直方图用于学习高斯过程回归(GPR)质量预测模型。建议的夜间图像 BIQA 方法在标准夜间图像数据库上进行了全面评估。实验结果表明,与其他先进的 BIQA 方法相比,所建议的夜间图像 BIQA 方法尽管实施简单、执行速度快,但预测性能更优越。因此,该方法可随时应用于实时场景。
{"title":"HNQA: histogram-based descriptors for fast night-time image quality assessment","authors":"Maryam Karimi, Mansour Nejati","doi":"10.1007/s00530-024-01440-7","DOIUrl":"https://doi.org/10.1007/s00530-024-01440-7","url":null,"abstract":"<p>Taking high quality images at night is a challenging issue for many applications. Therefore, assessing the quality of night-time images (NTIs) is a significant area of research. Since there is no reference image for such images, night-time image quality assessment (NTQA) must be performed blindly. Although the field of blind quality assessment of natural images has gained significant popularity over the past decade, the quality assessment of NTIs usually confront complex distortions such as contrast loss, chroma noise, color desaturation, and detail blur, that have been less investigated. In this paper, a blind night-time image quality evaluation model is proposed by generating innovative quality-aware local feature maps, including detail exposedness, color saturation, sharpness, contrast, and naturalness. In the next step, these maps are compressed and converted into global feature representations using histograms. These feature histograms are used to learn a Gaussian process regression (GPR) quality prediction model. The suggested BIQA approach for night images undergoes a comprehensive evaluation on a standard night image database. The results of the experiments demonstrate the superior prediction performance of the proposed BIQA method for night images compared to other advanced BIQA methods despite its simplicity of implementation and execution speed. Hence, it is readily applicable in real-time scenarios.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"2 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1007/s00530-024-01454-1
Junfeng Wang, Jiayue Yang, Lidun
The presence of a large number of robot accounts on social media has led to negative social impacts. In most cases, the distribution of robot accounts and real human accounts is imbalanced, resulting in insufficient representativeness and poor performance of a few types of samples. Graph neural networks can effectively utilize user interaction and are widely used to process graph structure data, achieving good performance in robot detection. However, previous robot detection methods based on GNN mostly considered the impact of class imbalance. However, in graph-structured data, the imbalance caused by differences in the position and structure of labeled nodes makes the processing results of GNN prone to bias toward larger categories. Due to the lack of consideration for the unique connectivity issues of the graph structure, the classification performance of nodes is not ideal. Therefore, in response to the shortcomings of existing schemes, this paper proposes a class imbalanced node classification algorithm based on minority weighting and abnormal connectivity margin loss, which extends the traditional imbalanced classification idea in the field of machine learning to graph-structured data and jointly handles the problem of quantity imbalance and graph-structured abnormal connectivity to improve GNN’s perception of connection anomalies. In the node feature aggregation stage, weighted aggregation is applied to minority classes. In the oversampling stage, the SMOTE algorithm is used to process imbalanced data, while considering node representation and topology structure. Simultaneously training an edge generator to model relationship information, combined with abnormal connectivity margin loss, to enhance the model’s learning of connectivity information, greatly improving the quality of the edge generator. Finally, we evaluated a publicly available dataset, and the experimental results showed that it achieved good results in classifying imbalanced nodes.
{"title":"Wacml: based on graph neural network for imbalanced node classification algorithm","authors":"Junfeng Wang, Jiayue Yang, Lidun","doi":"10.1007/s00530-024-01454-1","DOIUrl":"https://doi.org/10.1007/s00530-024-01454-1","url":null,"abstract":"<p>The presence of a large number of robot accounts on social media has led to negative social impacts. In most cases, the distribution of robot accounts and real human accounts is imbalanced, resulting in insufficient representativeness and poor performance of a few types of samples. Graph neural networks can effectively utilize user interaction and are widely used to process graph structure data, achieving good performance in robot detection. However, previous robot detection methods based on GNN mostly considered the impact of class imbalance. However, in graph-structured data, the imbalance caused by differences in the position and structure of labeled nodes makes the processing results of GNN prone to bias toward larger categories. Due to the lack of consideration for the unique connectivity issues of the graph structure, the classification performance of nodes is not ideal. Therefore, in response to the shortcomings of existing schemes, this paper proposes a class imbalanced node classification algorithm based on minority weighting and abnormal connectivity margin loss, which extends the traditional imbalanced classification idea in the field of machine learning to graph-structured data and jointly handles the problem of quantity imbalance and graph-structured abnormal connectivity to improve GNN’s perception of connection anomalies. In the node feature aggregation stage, weighted aggregation is applied to minority classes. In the oversampling stage, the SMOTE algorithm is used to process imbalanced data, while considering node representation and topology structure. Simultaneously training an edge generator to model relationship information, combined with abnormal connectivity margin loss, to enhance the model’s learning of connectivity information, greatly improving the quality of the edge generator. Finally, we evaluated a publicly available dataset, and the experimental results showed that it achieved good results in classifying imbalanced nodes.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"128 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1007/s00530-024-01456-z
Zhu Wenyi, Ding Xiangling, Zhang Chao, Deng Yingqian, Zhao Yulin
Video matting is a technique used to replace foreground objects in video frames by predicting their alpha matte. Originally developed for film special effects, advertisements, and live streaming, video matting can also be exploited for malicious tampering, leaving imperceptible traces. This highlights the need for effective forensic techniques to detect such tampering. Current research in video matting forensics is limited, largely focusing on frame-by-frame analysis, which fails to account for the temporal characteristics of videos and thus falls short in accurately localizing tampered regions. In this paper, we address this gap by leveraging the entire video sequence to improve tampering detection. We propose a two-branch network that integrates contour information of tampered objects into the forgery localization process, enhancing the extraction of tampering traces and contour features. Additionally, we introduce a tamper contour detection module and a feature enhancement module to refine tampered region identification. Extensive experiments conducted on both overt and synthetic tampering datasets demonstrate that our method effectively locates tampered regions, outperforming existing video forensics techniques.
{"title":"Contour-assistance-based video matting localization","authors":"Zhu Wenyi, Ding Xiangling, Zhang Chao, Deng Yingqian, Zhao Yulin","doi":"10.1007/s00530-024-01456-z","DOIUrl":"https://doi.org/10.1007/s00530-024-01456-z","url":null,"abstract":"<p>Video matting is a technique used to replace foreground objects in video frames by predicting their alpha matte. Originally developed for film special effects, advertisements, and live streaming, video matting can also be exploited for malicious tampering, leaving imperceptible traces. This highlights the need for effective forensic techniques to detect such tampering. Current research in video matting forensics is limited, largely focusing on frame-by-frame analysis, which fails to account for the temporal characteristics of videos and thus falls short in accurately localizing tampered regions. In this paper, we address this gap by leveraging the entire video sequence to improve tampering detection. We propose a two-branch network that integrates contour information of tampered objects into the forgery localization process, enhancing the extraction of tampering traces and contour features. Additionally, we introduce a tamper contour detection module and a feature enhancement module to refine tampered region identification. Extensive experiments conducted on both overt and synthetic tampering datasets demonstrate that our method effectively locates tampered regions, outperforming existing video forensics techniques.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"58 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.
{"title":"Weakly-supervised temporal action localization using multi-branch attention weighting","authors":"Mengxue Liu, Wenjing Li, Fangzhen Ge, Xiangjun Gao","doi":"10.1007/s00530-024-01445-2","DOIUrl":"https://doi.org/10.1007/s00530-024-01445-2","url":null,"abstract":"<p>Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"62 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1007/s00530-024-01457-y
Zunfu Wang, Fang Liu, Changjuan Ran
Style transfer aims to apply the stylistic characteristics of a reference image onto a target image or video. Existing studies on style transfer suffer from either fixed style without adjustability or unclear stylistic patterns in output results. Moreover, concerning video style transfer, issues such as discontinuity in content and time, flickering, and local distortions are common. Current research on artistic image style transfer mainly focuses on Western painting. In view of the differences between Eastern and Western painting, the existing methods cannot be directly applied to the style transfer of Chinese painting. To address the aforementioned issues, we propose a controllable style transfer method based on generative adversarial networks. The method operates directly in the feature space of style and content domains, synthesizing target images by merging style features and content features. To enhance the output stylization effect of Chinese painting, we incorporate stroke constraints and ink diffusion constraints to improve the visual quality. To mitigate issues such as blank spaces, highlights, and color confusion resulting in flickering and noise in Chinese painting style videos, we propose a flow-based stylized video optimization strategy to ensure consistency in content and time. Qualitative and quantitative experimental results show that our method outperforms state-of-the-art style transfer methods.
{"title":"Cvstgan: A Controllable Generative Adversarial Network for Video Style Transfer of Chinese Painting","authors":"Zunfu Wang, Fang Liu, Changjuan Ran","doi":"10.1007/s00530-024-01457-y","DOIUrl":"https://doi.org/10.1007/s00530-024-01457-y","url":null,"abstract":"<p>Style transfer aims to apply the stylistic characteristics of a reference image onto a target image or video. Existing studies on style transfer suffer from either fixed style without adjustability or unclear stylistic patterns in output results. Moreover, concerning video style transfer, issues such as discontinuity in content and time, flickering, and local distortions are common. Current research on artistic image style transfer mainly focuses on Western painting. In view of the differences between Eastern and Western painting, the existing methods cannot be directly applied to the style transfer of Chinese painting. To address the aforementioned issues, we propose a controllable style transfer method based on generative adversarial networks. The method operates directly in the feature space of style and content domains, synthesizing target images by merging style features and content features. To enhance the output stylization effect of Chinese painting, we incorporate stroke constraints and ink diffusion constraints to improve the visual quality. To mitigate issues such as blank spaces, highlights, and color confusion resulting in flickering and noise in Chinese painting style videos, we propose a flow-based stylized video optimization strategy to ensure consistency in content and time. Qualitative and quantitative experimental results show that our method outperforms state-of-the-art style transfer methods.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"17 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-29DOI: 10.1007/s00530-024-01453-2
Chengyun Ma, Qimeng Yang, Shengwei Tian, Long Yu, Shirong Yu
Automatic skin lesion segmentation from dermoscopy images is of great significance in the early treatment of skin cancers, which is yet challenging even for dermatologists due to the inherent issues, i.e., considerable size, shape and color variation, and ambiguous boundaries. In this paper, we propose a network BSP-Net that implements the combination of critical boundary information and segmentation tasks to simultaneously solve the variation and boundary problems in skin lesion segmentation. The architecture of BSP-Net primarily consists of a multi-scale boundary enhancement (MBE) module and a progressive fusion decoder (PD). The MBE module, by deeply extracting boundary information in both multi-axis frequency and multi-scale spatial domains, generates precise boundary key-point prediction maps. This process not only accurately models local boundary information but also effectively retains global contextual information. On the other hand, the PD employs an asymmetric decoding strategy, guiding the generation of refined segmentation results by combining boundary-enhanced features rich in geometric details with global features containing semantic information about lesions. This strategy progressively fuses boundary and semantic information at different levels, effectively enabling high-performance collaboration between cross-level contextual features. To assess the effectiveness of BSP-Net, we conducted extensive experiments on two public datasets (ISIC-2016 &PH2, ISIC-2018) and one private dataset (XJUSKin). BSP-Net achieved Dice coefficients of 90.81, 92.41, and 83.88%, respectively. Additionally, it demonstrated precise boundary delineation with Average Symmetric Surface Distance (ASSD) scores of 7.96, 6.88, and 10.92%, highlighting its strong performance in skin lesion segmentation.
{"title":"BSP-Net: automatic skin lesion segmentation improved by boundary enhancement and progressive decoding methods","authors":"Chengyun Ma, Qimeng Yang, Shengwei Tian, Long Yu, Shirong Yu","doi":"10.1007/s00530-024-01453-2","DOIUrl":"https://doi.org/10.1007/s00530-024-01453-2","url":null,"abstract":"<p>Automatic skin lesion segmentation from dermoscopy images is of great significance in the early treatment of skin cancers, which is yet challenging even for dermatologists due to the inherent issues, i.e., considerable size, shape and color variation, and ambiguous boundaries. In this paper, we propose a network BSP-Net that implements the combination of critical boundary information and segmentation tasks to simultaneously solve the variation and boundary problems in skin lesion segmentation. The architecture of BSP-Net primarily consists of a multi-scale boundary enhancement (MBE) module and a progressive fusion decoder (PD). The MBE module, by deeply extracting boundary information in both multi-axis frequency and multi-scale spatial domains, generates precise boundary key-point prediction maps. This process not only accurately models local boundary information but also effectively retains global contextual information. On the other hand, the PD employs an asymmetric decoding strategy, guiding the generation of refined segmentation results by combining boundary-enhanced features rich in geometric details with global features containing semantic information about lesions. This strategy progressively fuses boundary and semantic information at different levels, effectively enabling high-performance collaboration between cross-level contextual features. To assess the effectiveness of BSP-Net, we conducted extensive experiments on two public datasets (ISIC-2016 &PH2, ISIC-2018) and one private dataset (XJUSKin). BSP-Net achieved Dice coefficients of 90.81, 92.41, and 83.88%, respectively. Additionally, it demonstrated precise boundary delineation with Average Symmetric Surface Distance (ASSD) scores of 7.96, 6.88, and 10.92%, highlighting its strong performance in skin lesion segmentation.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"28 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}