Pub Date : 2024-08-06DOI: 10.1007/s00371-024-03596-9
Mingjian Li, Younhyun Jung, Shaoli Song, Jinman Kim
Direct volume rendering (DVR) is a commonly utilized technique for three-dimensional visualization of volumetric medical images. A key goal of DVR is to enable users to visually emphasize regions of interest (ROIs) which may be occluded by other structures. Conventional methods for ROIs visual emphasis require extensive user involvement for the adjustment of rendering parameters to reduce the occlusion, dependent on the user’s viewing direction. Several works have been proposed to automatically preserve the view of the ROIs by eliminating the occluding structures of lower importance in a view-dependent manner. However, they require pre-segmentation labeling and manual importance assignment on the images. An alternative to ROIs segmentation is to use ‘saliency’ to identify important regions. This however lacks semantic information and thus leads to the inclusion of false positive regions. In this study, we propose an attention-driven visual emphasis method for volumetric medical image visualization. We developed a deep learning attention model, termed as focused-class attention map (F-CAM), trained with only image-wise labels for automated ROIs localization and importance estimation. Our F-CAM transfers the semantic information from the classification task for use in the localization of ROIs, with a focus on small ROIs that characterize medical images. Additionally, we propose an attention compositing module that integrates the generated attention map with transfer function within the DVR pipeline to automate the view-dependent visual emphasis of the ROIs. We demonstrate the superiority of our method compared to existing methods on a multi-modality PET-CT dataset and an MRI dataset.
直接容积渲染(DVR)是一种常用的容积医学图像三维可视化技术。直接容积渲染的一个关键目标是让用户能够直观地强调可能被其他结构遮挡的感兴趣区域(ROI)。传统的 ROI 视觉强调方法需要用户广泛参与,根据用户的观察方向调整渲染参数以减少遮挡。有几种方法可以根据视图自动消除重要性较低的遮挡结构,从而保留 ROI 的视图。不过,这些方法需要对图像进行预分割标记和手动重要度分配。替代 ROI 分割的方法是使用 "显著性 "来识别重要区域。然而,这种方法缺乏语义信息,因此会包含假阳性区域。在本研究中,我们提出了一种用于体积医学图像可视化的注意力驱动视觉强调方法。我们开发了一种深度学习注意力模型,称为 "聚焦类注意力图(F-CAM)",该模型仅使用图像标签进行训练,用于自动 ROI 定位和重要性估计。我们的 F-CAM 将分类任务中的语义信息用于 ROI 的定位,重点关注医疗图像中的小型 ROI。此外,我们还提出了一个注意力合成模块,该模块将生成的注意力地图与 DVR 管道中的转移函数整合在一起,从而自动完成与视图相关的 ROI 视觉强调。我们在多模态 PET-CT 数据集和 MRI 数据集上证明了我们的方法优于现有方法。
{"title":"Attention-driven visual emphasis for medical volumetric image visualization","authors":"Mingjian Li, Younhyun Jung, Shaoli Song, Jinman Kim","doi":"10.1007/s00371-024-03596-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03596-9","url":null,"abstract":"<p>Direct volume rendering (DVR) is a commonly utilized technique for three-dimensional visualization of volumetric medical images. A key goal of DVR is to enable users to visually emphasize regions of interest (ROIs) which may be occluded by other structures. Conventional methods for ROIs visual emphasis require extensive user involvement for the adjustment of rendering parameters to reduce the occlusion, dependent on the user’s viewing direction. Several works have been proposed to automatically preserve the view of the ROIs by eliminating the occluding structures of lower importance in a view-dependent manner. However, they require pre-segmentation labeling and manual importance assignment on the images. An alternative to ROIs segmentation is to use ‘saliency’ to identify important regions. This however lacks semantic information and thus leads to the inclusion of false positive regions. In this study, we propose an attention-driven visual emphasis method for volumetric medical image visualization. We developed a deep learning attention model, termed as focused-class attention map (F-CAM), trained with only image-wise labels for automated ROIs localization and importance estimation. Our F-CAM transfers the semantic information from the classification task for use in the localization of ROIs, with a focus on small ROIs that characterize medical images. Additionally, we propose an attention compositing module that integrates the generated attention map with transfer function within the DVR pipeline to automate the view-dependent visual emphasis of the ROIs. We demonstrate the superiority of our method compared to existing methods on a multi-modality PET-CT dataset and an MRI dataset.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s00371-024-03571-4
Pan Wu, Jin Tang
Thanks to many current deep learning-based multi-focus image fusion methods have the defects of over-extracting image local features or neglecting image global features, these methods lead to final fused images with color distortion, small-area artifacts, large-area blurring, and unsoft boundary transitions. To solve these problems, we propose a new global and local feature hierarchical fusion network for multi-focus image fusion, called FHFN. The proposed FHFN is a deep neural network that simultaneously extracts global features using Swin Transformer and local features using ConvNeXt. On the one hand, we use the PSA module to enhance the focus on local features of images and effectively interact shallow features and high-level semantic features. On the other hand, we design the hierarchical fusion of extracted local features and global features by the hierarchical feature fusion module (HFFB), which constitutes a new image fusion task paradigm for solving multi-focus image fusion tasks. On the other hand, we introduce the gradient residual dense module (RGDB) to strengthen the edge features of images and improve the extraction capability of fine-grained spatial features of the network. Our method is competitive with ten other MFIF methods on four public datasets in terms of both objective quantitative metrics and subjective visual perception, and outperforms other MFIF methods in the same field.
{"title":"FHFN: content and context feature hierarchical fusion networks for multi-focus image fusion","authors":"Pan Wu, Jin Tang","doi":"10.1007/s00371-024-03571-4","DOIUrl":"https://doi.org/10.1007/s00371-024-03571-4","url":null,"abstract":"<p>Thanks to many current deep learning-based multi-focus image fusion methods have the defects of over-extracting image local features or neglecting image global features, these methods lead to final fused images with color distortion, small-area artifacts, large-area blurring, and unsoft boundary transitions. To solve these problems, we propose a new global and local feature hierarchical fusion network for multi-focus image fusion, called FHFN. The proposed FHFN is a deep neural network that simultaneously extracts global features using Swin Transformer and local features using ConvNeXt. On the one hand, we use the PSA module to enhance the focus on local features of images and effectively interact shallow features and high-level semantic features. On the other hand, we design the hierarchical fusion of extracted local features and global features by the hierarchical feature fusion module (HFFB), which constitutes a new image fusion task paradigm for solving multi-focus image fusion tasks. On the other hand, we introduce the gradient residual dense module (RGDB) to strengthen the edge features of images and improve the extraction capability of fine-grained spatial features of the network. Our method is competitive with ten other MFIF methods on four public datasets in terms of both objective quantitative metrics and subjective visual perception, and outperforms other MFIF methods in the same field.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s00371-024-03585-y
Hong Zhao, Wengai Li, Dailin Huang, Jinhai Huang, Lijun Zhang
Generating high-quality and realistic images based on textual descriptions is a formidable challenge, encompassing three critical aspects: (1) Data imbalance causes difficulties in feature learning when samples from rare categories are underrepresented in existing datasets; (2) multimodal feature fusion is widely used in the past struggles to effectively emphasize key joint features, resulting in weak interactions between different modes; and (3) the entanglement between the generator and discriminator in GANs poses challenges, particularly for the discriminator to effectively fulfill its designated role. To address these issues, this paper proposes a multiattribute learning and multimodal feature fusion-based generative adversarial network (M-GAN). Essentially, this paper contributes: (1) A multiattribute learning approach is introduced to mitigate data imbalance by enhancing heterogeneous vocabulary and category-relevant labels, which facilitates attribute information propagation into images, resulting in images that better meet task requirements; (2) a multimodal feature fusion approach based on gated attention and enhanced attention emphasizes vital information while suppressing non-essential details, enhancing intermodal interaction and improving fusion accuracy through stronger attention to intramodality correlations; and (3) an optimized generative adversarial network structure employs a U-Net discriminator to capture both structural and semantic changes between real and fake images, improving model performance and generating more realistic images by capturing global structure as well as local details. Extensive experiments conducted on the CUB-200 and MS-COCO datasets demonstrate the effectiveness of our M-GAN approach in text-to-image synthesis. The codes will be released at https://github.com/CodeSet1/M-GAN.
{"title":"M-GAN: multiattribute learning and multimodal feature fusion-based generative adversarial network for text-to-image synthesis","authors":"Hong Zhao, Wengai Li, Dailin Huang, Jinhai Huang, Lijun Zhang","doi":"10.1007/s00371-024-03585-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03585-y","url":null,"abstract":"<p>Generating high-quality and realistic images based on textual descriptions is a formidable challenge, encompassing three critical aspects: (1) Data imbalance causes difficulties in feature learning when samples from rare categories are underrepresented in existing datasets; (2) multimodal feature fusion is widely used in the past struggles to effectively emphasize key joint features, resulting in weak interactions between different modes; and (3) the entanglement between the generator and discriminator in GANs poses challenges, particularly for the discriminator to effectively fulfill its designated role. To address these issues, this paper proposes a multiattribute learning and multimodal feature fusion-based generative adversarial network (M-GAN). Essentially, this paper contributes: (1) A multiattribute learning approach is introduced to mitigate data imbalance by enhancing heterogeneous vocabulary and category-relevant labels, which facilitates attribute information propagation into images, resulting in images that better meet task requirements; (2) a multimodal feature fusion approach based on gated attention and enhanced attention emphasizes vital information while suppressing non-essential details, enhancing intermodal interaction and improving fusion accuracy through stronger attention to intramodality correlations; and (3) an optimized generative adversarial network structure employs a U-Net discriminator to capture both structural and semantic changes between real and fake images, improving model performance and generating more realistic images by capturing global structure as well as local details. Extensive experiments conducted on the CUB-200 and MS-COCO datasets demonstrate the effectiveness of our M-GAN approach in text-to-image synthesis. The codes will be released at https://github.com/CodeSet1/M-GAN.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s00371-024-03593-y
Zezheng Tang, Yihua Wu, Xinming Xu
In image recognition, the overlap of strawberries seriously reduces the recognition efficiency of ripe strawberries. This paper proposes an improved YOLOv7-Tiny model for recognizing ripe strawberries. A lightweight RepGhost model is added to the YOLOv7-Tiny model to reduce the computation and the number of model parameters. The SiLU function replaces the LeakeyReLU activation function of the backbone CBL conditional block to improve the nonlinear fitting and feature learning capabilities of the mode. The nonlinear fitting and feature learning capabilities of the model are improved. The C3 module is fused in the small-object layer to improve the ability to extract information from small objects. The performance of the improved YOLOv7-Tiny model is validated through experiments. The results show that the parameters of the model are reduced by 26.9%, the calculation amount is reduced by 55.4%, the recognition speed is improved by 26.3%, and the mean average precision (mAP) is 89.8%. Compared with SSD, Faster RCNN, YOLOv3, YOLOv4, and YOLOv5s models, the mAP of the YOLOv7-Tiny model increased by 14.2%, 1.52%, 3.15%, 3.01%, and 2.6%. The recognition speed increased by 79.3%, 92.9%, 80.4%, 58.8%, and 69.6%. The number of parameters decreased by 90%, 89.7%, 95%, 47.8%, and 14.6%. The recognition accuracy of overlapping and small strawberries is significantly improved in the improved YOLOv7-Tiny model. The model provides technical support for efficient automatic strawberry picking.
{"title":"The study of recognizing ripe strawberries based on the improved YOLOv7-Tiny model","authors":"Zezheng Tang, Yihua Wu, Xinming Xu","doi":"10.1007/s00371-024-03593-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03593-y","url":null,"abstract":"<p>In image recognition, the overlap of strawberries seriously reduces the recognition efficiency of ripe strawberries. This paper proposes an improved YOLOv7-Tiny model for recognizing ripe strawberries. A lightweight RepGhost model is added to the YOLOv7-Tiny model to reduce the computation and the number of model parameters. The SiLU function replaces the LeakeyReLU activation function of the backbone CBL conditional block to improve the nonlinear fitting and feature learning capabilities of the mode. The nonlinear fitting and feature learning capabilities of the model are improved. The C3 module is fused in the small-object layer to improve the ability to extract information from small objects. The performance of the improved YOLOv7-Tiny model is validated through experiments. The results show that the parameters of the model are reduced by 26.9%, the calculation amount is reduced by 55.4%, the recognition speed is improved by 26.3%, and the mean average precision (mAP) is 89.8%. Compared with SSD, Faster RCNN, YOLOv3, YOLOv4, and YOLOv5s models, the mAP of the YOLOv7-Tiny model increased by 14.2%, 1.52%, 3.15%, 3.01%, and 2.6%. The recognition speed increased by 79.3%, 92.9%, 80.4%, 58.8%, and 69.6%. The number of parameters decreased by 90%, 89.7%, 95%, 47.8%, and 14.6%. The recognition accuracy of overlapping and small strawberries is significantly improved in the improved YOLOv7-Tiny model. The model provides technical support for efficient automatic strawberry picking.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s00371-024-03588-9
R. Varun Prakash, V. Karthikeyan, S. Vishali, M. Karthika
Human–animal conflict (HAC) is one of the main issues that the government of India is now addressing. In this work, we proposed a stacked long short-term memory (LSTM) as well as hybrid features for automatic wild animal detection and state of mind classification based on intelligent perception of the environment. The elephant was the wildlife animal under consideration in this work. This study initially collects the information of wild animals from their environment. We then extracted and combined the mel frequency cepstral coefficient (MFCC), delta MFCC, double delta MFCC, and Linear Predictive Coding (LPC) features in various combinations. This combination of MFCC and its derivatives with LPC provides improved performance. After that, the elephants are identified, and their state of mind (SOM) is classified by utilising the proposed stacked LSTM framework. The results obtained demonstrated that the stacked LSTM framework performed better than both the single LSTM and the bidirectional LSTM learning network. For elephant detection, the classification accuracy obtained was 98%, and for state-of-mind detection, the classification accuracy obtained was 97%. Further, if the presence of elephants is confirmed, it is repelled with the help of an animated predator to scare the animal.
{"title":"Multi-level LSTM framework with hybrid sonic features for human–animal conflict evasion","authors":"R. Varun Prakash, V. Karthikeyan, S. Vishali, M. Karthika","doi":"10.1007/s00371-024-03588-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03588-9","url":null,"abstract":"<p>Human–animal conflict (HAC) is one of the main issues that the government of India is now addressing. In this work, we proposed a stacked long short-term memory (LSTM) as well as hybrid features for automatic wild animal detection and state of mind classification based on intelligent perception of the environment. The elephant was the wildlife animal under consideration in this work. This study initially collects the information of wild animals from their environment. We then extracted and combined the mel frequency cepstral coefficient (MFCC), delta MFCC, double delta MFCC, and Linear Predictive Coding (LPC) features in various combinations. This combination of MFCC and its derivatives with LPC provides improved performance. After that, the elephants are identified, and their state of mind (SOM) is classified by utilising the proposed stacked LSTM framework. The results obtained demonstrated that the stacked LSTM framework performed better than both the single LSTM and the bidirectional LSTM learning network. For elephant detection, the classification accuracy obtained was 98%, and for state-of-mind detection, the classification accuracy obtained was 97%. Further, if the presence of elephants is confirmed, it is repelled with the help of an animated predator to scare the animal.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141969220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-05DOI: 10.1007/s00371-024-03578-x
Changhong Shi, Weirong Liu, Jiahao Meng, Xiongfei Jia, Jie Liu
Great progress has been made in image inpainting tasks with the emergence of convolutional neural networks, because of their superior translation invariance and powerful texture modeling capacity. However, current solutions generally do not perform well in reconstructing high-quality results. To address this issues, a self-prior guided generative adversarial network (SG-GAN) model is proposed. SG-GAN integrates the learning paradigms of cross-attention and convolution to the generator. It is able to learn the cross-mapping between input and target dataset effectively. Then, a high receptive field subnet is constructed to increase the receptive field. Finally, a high receptive field feature-matching loss is proposed to further ensure the structure sharpness of generated images. Experiments on datasets including natural scene images (Places2), facial images (CelebA-HQ), structured wall images (Façade), and Dunhuang Mural images show that the proposed method can generate higher quality results with more details than state-of-the-art.
{"title":"Self-prior guided generative adversarial network for image inpainting","authors":"Changhong Shi, Weirong Liu, Jiahao Meng, Xiongfei Jia, Jie Liu","doi":"10.1007/s00371-024-03578-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03578-x","url":null,"abstract":"<p>Great progress has been made in image inpainting tasks with the emergence of convolutional neural networks, because of their superior translation invariance and powerful texture modeling capacity. However, current solutions generally do not perform well in reconstructing high-quality results. To address this issues, a self-prior guided generative adversarial network (SG-GAN) model is proposed. SG-GAN integrates the learning paradigms of cross-attention and convolution to the generator. It is able to learn the cross-mapping between input and target dataset effectively. Then, a high receptive field subnet is constructed to increase the receptive field. Finally, a high receptive field feature-matching loss is proposed to further ensure the structure sharpness of generated images. Experiments on datasets including natural scene images (Places2), facial images (CelebA-HQ), structured wall images (Façade), and Dunhuang Mural images show that the proposed method can generate higher quality results with more details than state-of-the-art.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geometric distortions in digital images, caused by factors such as lens defects and changes in camera angles, substantially influence the fidelity of the image by altering pixel positions and shapes. Current geometric distortion correction methods, focusing on specific types of distortions and relying on high computational resources, face limitations in universality and practicality across diverse real-world applications. We propose here a two-stage distortion correction method that integrates deep learning with traditional image registration algorithms for correcting multiple types of geometric distortion. Compared to state-of-the-art correction methods, our proposed method demonstrates flexibility, capable of addressing a wide range of geometric distortions and achieves superior correction results with fewer parameters. In addition, tests performed on synthetic datasets show an improvement of 10.39% for PSNR, 30.42% for SSIM, and 85% for processing speed, compared to the best performing methods to our knowledge. Finally, experiments with handheld medical endoscopic scanners confirm the applicability and robustness of our method in real-world scenarios. Our method offers a versatile and efficient solution for geometric distortion correction, suitable for various applications, including medical imaging and resource-limited embedded systems. Code is available at https://github.com/MaybeRichard/EffiGeoNet
{"title":"An efficient deep learning-based framework for image distortion correction","authors":"Sicheng Li, Yuhui Chu, Yunpeng Zhao, Pengpeng Zhao","doi":"10.1007/s00371-024-03580-3","DOIUrl":"https://doi.org/10.1007/s00371-024-03580-3","url":null,"abstract":"<p>Geometric distortions in digital images, caused by factors such as lens defects and changes in camera angles, substantially influence the fidelity of the image by altering pixel positions and shapes. Current geometric distortion correction methods, focusing on specific types of distortions and relying on high computational resources, face limitations in universality and practicality across diverse real-world applications. We propose here a two-stage distortion correction method that integrates deep learning with traditional image registration algorithms for correcting multiple types of geometric distortion. Compared to state-of-the-art correction methods, our proposed method demonstrates flexibility, capable of addressing a wide range of geometric distortions and achieves superior correction results with fewer parameters. In addition, tests performed on synthetic datasets show an improvement of 10.39% for PSNR, 30.42% for SSIM, and 85% for processing speed, compared to the best performing methods to our knowledge. Finally, experiments with handheld medical endoscopic scanners confirm the applicability and robustness of our method in real-world scenarios. Our method offers a versatile and efficient solution for geometric distortion correction, suitable for various applications, including medical imaging and resource-limited embedded systems. Code is available at https://github.com/MaybeRichard/EffiGeoNet</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141882128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-03DOI: 10.1007/s00371-024-03591-0
Fengling Li, Zheng Yang, Yan Gui
Small object graphics detection plays a crucial role in various domains, including surveillance, urban management, and autonomous driving. However, existing object detection methods perform poorly when it comes to detecting multiple small objects. To tackle this issue, we propose the SES-yolov5 algorithm for small object detection that incorporates a multi-scale fusion attention mechanism and feature enhancement techniques. Firstly, we enhance the neck network structure by integrating shallow feature fusion (SFF) and small object detection head (STD), enabling the extraction of more detailed shallow feature information from high-resolution images. Secondly, we integrate an efficient channel and spatial attention (ECSA) mechanism into the backbone network to further filter redundant semantic information while highlighting the small objects for detection. Finally, we introduce a spatial feature refinement module (SFRM) to connect the main network with the neck network, enhancing rich features of input neck data while expanding the receptive field of images and minimizing loss of small object information. Experimental results on the VisDrone2021 dataset demonstrate that compared to traditional YOLOv5 algorithm, SES-yolov5 achieves an 8.3% increase in mAP50 score along with improved detection accuracy by 7.5% and increased recall rate by 6.4% on average. The effectiveness of our method is also validated on the TT100K dataset. Code is available at https://github.com/Yangzheng00/SES-yolov5.git.
{"title":"SES-yolov5: small object graphics detection and visualization applications","authors":"Fengling Li, Zheng Yang, Yan Gui","doi":"10.1007/s00371-024-03591-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03591-0","url":null,"abstract":"<p>Small object graphics detection plays a crucial role in various domains, including surveillance, urban management, and autonomous driving. However, existing object detection methods perform poorly when it comes to detecting multiple small objects. To tackle this issue, we propose the SES-yolov5 algorithm for small object detection that incorporates a multi-scale fusion attention mechanism and feature enhancement techniques. Firstly, we enhance the neck network structure by integrating shallow feature fusion (SFF) and small object detection head (STD), enabling the extraction of more detailed shallow feature information from high-resolution images. Secondly, we integrate an efficient channel and spatial attention (ECSA) mechanism into the backbone network to further filter redundant semantic information while highlighting the small objects for detection. Finally, we introduce a spatial feature refinement module (SFRM) to connect the main network with the neck network, enhancing rich features of input neck data while expanding the receptive field of images and minimizing loss of small object information. Experimental results on the VisDrone2021 dataset demonstrate that compared to traditional YOLOv5 algorithm, SES-yolov5 achieves an 8.3% increase in mAP50 score along with improved detection accuracy by 7.5% and increased recall rate by 6.4% on average. The effectiveness of our method is also validated on the TT100K dataset. Code is available at https://github.com/Yangzheng00/SES-yolov5.git.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141882044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Few-shot semantic segmentation aims to learn a generalized model for unseen-class segmentation with just a few densely annotated samples. Most current metric-based prototype learning models utilize prototypes to assist in query sample segmentation by directly utilizing support samples through Masked Average Pooling. However, these methods frequently fail to consider the semantic ambiguity of prototypes, the limitations in performance when dealing with extreme variations in objects, and the semantic similarities between different classes. In this paper, we introduce a novel network architecture named Prototype-guided Salient Attention Network (PSANet). Specifically, we employ prototype-guided attention to learn salient regions, allocating different attention weights to features at different spatial locations of the target to enhance the significance of salient regions within the prototype. In order to mitigate the impact of external distractor categories on the prototype, our proposed contrastive loss has the capability to acquire a more discriminative prototype to promote inter-class feature separation and intra-class feature compactness. Moreover, we suggest implementing a refinement operation for the multi-scale module in order to enhance the ability to capture complete contextual information regarding features at various scales. The effectiveness of our strategy is demonstrated by extensive tests performed on the (mathrm{PASCAL-5}^{i}) and (mathrm{COCO-20}^{i}) datasets, despite its inherent simplicity. Our code is available at https://github.com/woaixuexixuexi/PSANet.
{"title":"Psanet: prototype-guided salient attention for few-shot segmentation","authors":"Hao Li, Guoheng Huang, Xiaochen Yuan, Zewen Zheng, Xuhang Chen, Guo Zhong, Chi-Man Pun","doi":"10.1007/s00371-024-03582-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03582-1","url":null,"abstract":"<p>Few-shot semantic segmentation aims to learn a generalized model for unseen-class segmentation with just a few densely annotated samples. Most current metric-based prototype learning models utilize prototypes to assist in query sample segmentation by directly utilizing support samples through Masked Average Pooling. However, these methods frequently fail to consider the semantic ambiguity of prototypes, the limitations in performance when dealing with extreme variations in objects, and the semantic similarities between different classes. In this paper, we introduce a novel network architecture named Prototype-guided Salient Attention Network (PSANet). Specifically, we employ prototype-guided attention to learn salient regions, allocating different attention weights to features at different spatial locations of the target to enhance the significance of salient regions within the prototype. In order to mitigate the impact of external distractor categories on the prototype, our proposed contrastive loss has the capability to acquire a more discriminative prototype to promote inter-class feature separation and intra-class feature compactness. Moreover, we suggest implementing a refinement operation for the multi-scale module in order to enhance the ability to capture complete contextual information regarding features at various scales. The effectiveness of our strategy is demonstrated by extensive tests performed on the <span>(mathrm{PASCAL-5}^{i})</span> and <span>(mathrm{COCO-20}^{i})</span> datasets, despite its inherent simplicity. Our code is available at https://github.com/woaixuexixuexi/PSANet.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s00371-024-03572-3
Ling-Xiao Qin, Hong-Mei Sun, Xiao-Meng Duan, Cheng-Yue Che, Rui-Sheng Jia
In order to maintain competitive density estimation performance, most of the existing works design cumbersome network structures to extract and refine vehicle features, resulting in huge computational resource consumption and storage burden during the inference process, which severely limits their deployment scope and makes it difficult to be applied in practical scenarios. To solve the above problems, we propose a lightweight network for real-time vehicle density estimation (LSENet). Specifically, the network consists of three parts: a pre-trained heavy teacher network, an adaptive integration block and a lightweight student network. First, a teacher network based on a deep single-column transformer is designed as a means to provide effective global dependency and vehicle distribution knowledge for the student network to learn. Second, to address the intermediate layer mismatch and dimensionality inconsistency between the teacher network and the student network, an adaptive integration block is designed to efficiently guide the student network learning by dynamically assigning the self-attention heads that has the most influence on the network decision as a source of distilled knowledge. Finally, to complement the fine-grained features, CNN blocks are designed in parallel with the student network transformer backbone as a way to improve the network’s ability to capture vehicle details. Extensive experiments on two vehicle benchmark datasets, TRANCOS and VisDrone2019, show that LSENet achieves an optimal trade-off between density estimation accuracy and operational speed compared to other state-of-the-art methods and is therefore suitable for deployment on computationally resource-poor edge devices. Our codes will be available at https://github.com/goudaner1/LSENet.
{"title":"Adaptive learning-enhanced lightweight network for real-time vehicle density estimation","authors":"Ling-Xiao Qin, Hong-Mei Sun, Xiao-Meng Duan, Cheng-Yue Che, Rui-Sheng Jia","doi":"10.1007/s00371-024-03572-3","DOIUrl":"https://doi.org/10.1007/s00371-024-03572-3","url":null,"abstract":"<p>In order to maintain competitive density estimation performance, most of the existing works design cumbersome network structures to extract and refine vehicle features, resulting in huge computational resource consumption and storage burden during the inference process, which severely limits their deployment scope and makes it difficult to be applied in practical scenarios. To solve the above problems, we propose a lightweight network for real-time vehicle density estimation (LSENet). Specifically, the network consists of three parts: a pre-trained heavy teacher network, an adaptive integration block and a lightweight student network. First, a teacher network based on a deep single-column transformer is designed as a means to provide effective global dependency and vehicle distribution knowledge for the student network to learn. Second, to address the intermediate layer mismatch and dimensionality inconsistency between the teacher network and the student network, an adaptive integration block is designed to efficiently guide the student network learning by dynamically assigning the self-attention heads that has the most influence on the network decision as a source of distilled knowledge. Finally, to complement the fine-grained features, CNN blocks are designed in parallel with the student network transformer backbone as a way to improve the network’s ability to capture vehicle details. Extensive experiments on two vehicle benchmark datasets, TRANCOS and VisDrone2019, show that LSENet achieves an optimal trade-off between density estimation accuracy and operational speed compared to other state-of-the-art methods and is therefore suitable for deployment on computationally resource-poor edge devices. Our codes will be available at https://github.com/goudaner1/LSENet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}