Pub Date : 2024-08-26DOI: 10.1007/s00371-024-03595-w
Jianjun Zhu, Huihuang Zhao, Yudong Zhang
Human motion transfer is challenging due to the complexity and diversity of human motion and clothing textures. Existing methods use 2D pose estimation to obtain poses, which can easily lead to unsmooth motion and artifacts. Therefore, this paper proposes a highly robust motion transmission model based on image deformation, called the Filter-Deform Attention Generative Adversarial Network (FDA GAN). This method can transmit complex human motion videos using only few human images. First, we use a 3D pose shape estimator instead of traditional 2D pose estimation to address the problem of unsmooth motion. Then, to tackle the artifact problem, we design a new attention mechanism and integrate it with the GAN, proposing a new network capable of effectively extracting image features and generating human motion videos. Finally, to further transfer the style of the source human, we propose a two-stream style loss, which enhances the model’s learning ability. Experimental results demonstrate that the proposed method outperforms recent methods in overall performance and various evaluation metrics. Project page: https://github.com/mioyeah/FDA-GAN.
由于人体运动和服装纹理的复杂性和多样性,人体运动传输具有挑战性。现有方法使用二维姿态估计来获取姿态,这很容易导致运动不平滑和伪影。因此,本文提出了一种基于图像变形的高鲁棒性运动传输模型,称为滤波-变形注意生成对抗网络(FDA GAN)。这种方法只需使用少量人体图像就能传输复杂的人体运动视频。首先,我们使用三维姿态形状估计器代替传统的二维姿态估计器来解决不平滑运动的问题。然后,为了解决伪影问题,我们设计了一种新的注意力机制,并将其与 GAN 相结合,提出了一种能够有效提取图像特征并生成人体运动视频的新网络。最后,为了进一步传递源人类的风格,我们提出了双流风格损失,从而增强了模型的学习能力。实验结果表明,所提出的方法在整体性能和各种评价指标上都优于近期的方法。项目页面:https://github.com/mioyeah/FDA-GAN.
{"title":"Filter-deform attention GAN: constructing human motion videos from few images","authors":"Jianjun Zhu, Huihuang Zhao, Yudong Zhang","doi":"10.1007/s00371-024-03595-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03595-w","url":null,"abstract":"<p>Human motion transfer is challenging due to the complexity and diversity of human motion and clothing textures. Existing methods use 2D pose estimation to obtain poses, which can easily lead to unsmooth motion and artifacts. Therefore, this paper proposes a highly robust motion transmission model based on image deformation, called the Filter-Deform Attention Generative Adversarial Network (FDA GAN). This method can transmit complex human motion videos using only few human images. First, we use a 3D pose shape estimator instead of traditional 2D pose estimation to address the problem of unsmooth motion. Then, to tackle the artifact problem, we design a new attention mechanism and integrate it with the GAN, proposing a new network capable of effectively extracting image features and generating human motion videos. Finally, to further transfer the style of the source human, we propose a two-stream style loss, which enhances the model’s learning ability. Experimental results demonstrate that the proposed method outperforms recent methods in overall performance and various evaluation metrics. Project page: https://github.com/mioyeah/FDA-GAN.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26DOI: 10.1007/s00371-024-03609-7
Yaping Deng, Yingjiang Li, Zibo Wei, Keying Li
Depth completion aims to generate dense depth maps from sparse depth maps and corresponding RGB images. In this task, the locality based on the convolutional layer poses challenges for the network in obtaining global information. While the Transformer-based architecture performs well in capturing global information, it may lead to the loss of local detail features. Consequently, improving the simultaneous attention to global and local information is crucial for achieving effective depth completion. This paper proposes a novel and effective dual-encoder–three-decoder network, consisting of local and global branches. Specifically, the local branch uses a convolutional network, and the global branch utilizes a Transformer network to extract rich features. Meanwhile, the local branch is dominated by color image and the global branch is dominated by depth map to thoroughly integrate and utilize multimodal information. In addition, a gate fusion mechanism is used in the decoder stage to fuse local and global information, to achieving high-performance depth completion. This hybrid architecture is conducive to the effective fusion of local detail information and contextual information. Experimental results demonstrated the superiority of our method over other advanced methods on KITTI Depth Completion and NYU v2 datasets.
深度补全旨在从稀疏的深度图和相应的 RGB 图像生成密集的深度图。在这项任务中,基于卷积层的局部性给网络获取全局信息带来了挑战。虽然基于变换器的架构在捕捉全局信息方面表现出色,但它可能会导致局部细节特征的丢失。因此,改进对全局和局部信息的同时关注对于实现有效的深度补全至关重要。本文提出了一种新颖有效的双编码器-三解码器网络,由局部和全局分支组成。具体来说,局部分支使用卷积网络,全局分支使用变换器网络来提取丰富的特征。同时,局部分支以彩色图像为主,全局分支以深度图为主,以全面整合和利用多模态信息。此外,在解码器阶段还使用了门融合机制来融合本地和全局信息,从而实现高性能的深度补全。这种混合架构有利于有效融合局部细节信息和上下文信息。实验结果表明,在 KITTI 深度补全和 NYU v2 数据集上,我们的方法优于其他先进方法。
{"title":"GLDC: combining global and local consistency of multibranch depth completion","authors":"Yaping Deng, Yingjiang Li, Zibo Wei, Keying Li","doi":"10.1007/s00371-024-03609-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03609-7","url":null,"abstract":"<p>Depth completion aims to generate dense depth maps from sparse depth maps and corresponding RGB images. In this task, the locality based on the convolutional layer poses challenges for the network in obtaining global information. While the Transformer-based architecture performs well in capturing global information, it may lead to the loss of local detail features. Consequently, improving the simultaneous attention to global and local information is crucial for achieving effective depth completion. This paper proposes a novel and effective dual-encoder–three-decoder network, consisting of local and global branches. Specifically, the local branch uses a convolutional network, and the global branch utilizes a Transformer network to extract rich features. Meanwhile, the local branch is dominated by color image and the global branch is dominated by depth map to thoroughly integrate and utilize multimodal information. In addition, a gate fusion mechanism is used in the decoder stage to fuse local and global information, to achieving high-performance depth completion. This hybrid architecture is conducive to the effective fusion of local detail information and contextual information. Experimental results demonstrated the superiority of our method over other advanced methods on KITTI Depth Completion and NYU v2 datasets.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-24DOI: 10.1007/s00371-024-03607-9
Wenzhe Shi, Ziqi Hu, Hao Chen, Hengjia Zhang, Jiale Yang, Li Li
Detecting and removing specular highlights is a complex task that can greatly enhance various visual tasks in real-world environments. Although previous works have made great progress, they often ignore specular highlight areas or produce unsatisfactory results with visual artifacts such as color distortion. In this paper, we present a framework that utilizes an encoder–decoder structure for the combined task of specular highlight detection and removal in single images, employing specular highlight mask guidance. The encoder uses EfficientNet as a feature extraction backbone network to convert the input RGB image into a series of feature maps. The decoder gradually restores these feature maps to their original size through up-sampling. In the specular highlight detection module, we enhance the network by utilizing residual modules to extract additional feature information, thereby improving detection accuracy. For the specular highlight removal module, we introduce the Convolutional Block Attention Module, which dynamically captures the importance of each channel and spatial location in the input feature map. This enables the model to effectively distinguish between foreground and background, resulting in enhanced adaptability and accuracy in complex scenes. We evaluate the proposed method on the publicly available SHIQ dataset, and its superiority is demonstrated through a comparative analysis of the experimental results. The source code will be available at https://github.com/hzq2333/ORHLR-Net.
{"title":"Orhlr-net: one-stage residual learning network for joint single-image specular highlight detection and removal","authors":"Wenzhe Shi, Ziqi Hu, Hao Chen, Hengjia Zhang, Jiale Yang, Li Li","doi":"10.1007/s00371-024-03607-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03607-9","url":null,"abstract":"<p>Detecting and removing specular highlights is a complex task that can greatly enhance various visual tasks in real-world environments. Although previous works have made great progress, they often ignore specular highlight areas or produce unsatisfactory results with visual artifacts such as color distortion. In this paper, we present a framework that utilizes an encoder–decoder structure for the combined task of specular highlight detection and removal in single images, employing specular highlight mask guidance. The encoder uses EfficientNet as a feature extraction backbone network to convert the input RGB image into a series of feature maps. The decoder gradually restores these feature maps to their original size through up-sampling. In the specular highlight detection module, we enhance the network by utilizing residual modules to extract additional feature information, thereby improving detection accuracy. For the specular highlight removal module, we introduce the Convolutional Block Attention Module, which dynamically captures the importance of each channel and spatial location in the input feature map. This enables the model to effectively distinguish between foreground and background, resulting in enhanced adaptability and accuracy in complex scenes. We evaluate the proposed method on the publicly available SHIQ dataset, and its superiority is demonstrated through a comparative analysis of the experimental results. The source code will be available at https://github.com/hzq2333/ORHLR-Net.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Virtual try-on aims to transfer clothes from one image to another while preserving intricate wearer and clothing details. Tremendous efforts have been made to facilitate the task based on deep generative models such as GAN and diffusion models; however, the current methods have not taken into account the influence of the natural environment (background and unrelated impurities) on clothing image, leading to issues such as loss of detail, intricate textures, shadows, and folds. In this paper, we introduce Slot-VTON, a slot attention-based inpainting approach for seamless image generation in a subject-driven way. Specifically, we adopt an attention mechanism, termed slot attention, that can unsupervisedly separate the various subjects within images. With slot attention, we distill the clothing image into a series of slot representations, where each slot represents a subject. Guided by the extracted clothing slot, our method is capable of eliminating the interference of other unnecessary factors, thereby better preserving the complex details of the clothing. To further enhance the seamless generation of the diffusion model, we design a fusion adapter that integrates multiple conditions, including the slot and other added clothing conditions. In addition, a non-garment inpainting module is used to further fix visible seams and preserve non-clothing area details (hands, neck, etc.). Multiple experiments on VITON-HD datasets validate the efficacy of our methods, showcasing state-of-the-art generation performances. Our implementation is available at: https://github.com/SilverLakee/Slot-VTON.
{"title":"Slot-VTON: subject-driven diffusion-based virtual try-on with slot attention","authors":"Jianglei Ye, Yigang Wang, Fengmao Xie, Qin Wang, Xiaoling Gu, Zizhao Wu","doi":"10.1007/s00371-024-03603-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03603-z","url":null,"abstract":"<p>Virtual try-on aims to transfer clothes from one image to another while preserving intricate wearer and clothing details. Tremendous efforts have been made to facilitate the task based on deep generative models such as GAN and diffusion models; however, the current methods have not taken into account the influence of the natural environment (background and unrelated impurities) on clothing image, leading to issues such as loss of detail, intricate textures, shadows, and folds. In this paper, we introduce Slot-VTON, a slot attention-based inpainting approach for seamless image generation in a subject-driven way. Specifically, we adopt an attention mechanism, termed slot attention, that can unsupervisedly separate the various subjects within images. With slot attention, we distill the clothing image into a series of slot representations, where each slot represents a subject. Guided by the extracted clothing slot, our method is capable of eliminating the interference of other unnecessary factors, thereby better preserving the complex details of the clothing. To further enhance the seamless generation of the diffusion model, we design a fusion adapter that integrates multiple conditions, including the slot and other added clothing conditions. In addition, a non-garment inpainting module is used to further fix visible seams and preserve non-clothing area details (hands, neck, etc.). Multiple experiments on VITON-HD datasets validate the efficacy of our methods, showcasing state-of-the-art generation performances. Our implementation is available at: https://github.com/SilverLakee/Slot-VTON.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-23DOI: 10.1007/s00371-024-03600-2
Gang Chen, Wenju Wang, Haoran Zhou, Xiaolin Wang
It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features. Therefore, we explore transformer on topological structures of point cloud graphs, proposing an enhanced graph convolutional transformer (EGCT) method. EGCT construct graph topology for disordered and unstructured point cloud. Then it uses the enhanced point feature representation method to further aggregate the feature information of all neighborhood points, which can compactly represent the features of this local neighborhood graph. Subsequent process, the graph convolutional transformer simultaneously performs self-attention calculations and convolution operations on the point coordinates and features of the neighborhood graph. It efficiently utilizes the spatial geometric information of point cloud objects. Therefore, EGCT learns comprehensive geometric information of point cloud objects, which can help to improve segmentation and classification accuracy. On the ShapeNetPart and ModelNet40 datasets, our EGCT method achieves a mIoU of 86.8%, OA and AA of 93.5% and 91.2%, respectively. On the large-scale indoor scene point cloud dataset (S3DIS), the OA of EGCT method is 90.1%, and the mIoU is 67.8%. Experimental results demonstrate that our EGCT method can achieve comparable point cloud segmentation and classification performance to state-of-the-art methods while maintaining low model complexity. Our source code is available at https://github.com/shepherds001/EGCT.
{"title":"EGCT: enhanced graph convolutional transformer for 3D point cloud representation learning","authors":"Gang Chen, Wenju Wang, Haoran Zhou, Xiaolin Wang","doi":"10.1007/s00371-024-03600-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03600-2","url":null,"abstract":"<p>It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features. Therefore, we explore transformer on topological structures of point cloud graphs, proposing an enhanced graph convolutional transformer (EGCT) method. EGCT construct graph topology for disordered and unstructured point cloud. Then it uses the enhanced point feature representation method to further aggregate the feature information of all neighborhood points, which can compactly represent the features of this local neighborhood graph. Subsequent process, the graph convolutional transformer simultaneously performs self-attention calculations and convolution operations on the point coordinates and features of the neighborhood graph. It efficiently utilizes the spatial geometric information of point cloud objects. Therefore, EGCT learns comprehensive geometric information of point cloud objects, which can help to improve segmentation and classification accuracy. On the ShapeNetPart and ModelNet40 datasets, our EGCT method achieves a mIoU of 86.8%, OA and AA of 93.5% and 91.2%, respectively. On the large-scale indoor scene point cloud dataset (S3DIS), the OA of EGCT method is 90.1%, and the mIoU is 67.8%. Experimental results demonstrate that our EGCT method can achieve comparable point cloud segmentation and classification performance to state-of-the-art methods while maintaining low model complexity. Our source code is available at https://github.com/shepherds001/EGCT.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scene classification plays a vital role in the field of remote-sensing (RS). However, remote-sensing images have the essential properties of complex scene information and large-scale spatial changes, as well as the high similarity between various classes and the significant differences within the same class, which brings great challenges to scene classification. To address these issues, a multistage foreground-perception enhancement network (MFPENet) is proposed to enhance the ability to perceive foreground features, thereby improving classification accuracy. Firstly, to enrich the scene semantics of feature information, a multi-scale feature aggregation module is specifically designed using dilated convolution, which takes the features of different stages of the backbone network as input data to obtain enhanced multiscale features. Then, a novel foreground-perception enhancement module is designed to capture foreground information. Unlike the previous methods, we separate foreground features by designing feature masks and then innovatively explore the symbiotic relationship between foreground features and scene features to improve the recognition ability of foreground features further. Finally, a hierarchical attention module is designed to reduce the interference of redundant background details on classification. By embedding the dependence between adjacent level features into the attention mechanism, the model can pay more accurate attention to the key information. Redundancy is reduced, and the loss of useful information is minimized. Experiments on three public RS scene classification datasets [UC-Merced, Aerial Image Dataset, and NWPU-RESISC45] show that our method achieves highly competitive results. Future work will focus on utilizing the background features outside the effective foreground features in the image as a decision aid to improve the distinguishability between similar scenes. The source code of our proposed algorithm and the related datasets are available at https://github.com/Hpu-wcx/MFPENet.
{"title":"Mfpenet: multistage foreground-perception enhancement network for remote-sensing scene classification","authors":"Junding Sun, Chenxu Wang, Haifeng Sima, Xiaosheng Wu, Shuihua Wang, Yudong Zhang","doi":"10.1007/s00371-024-03587-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03587-w","url":null,"abstract":"<p>Scene classification plays a vital role in the field of remote-sensing (RS). However, remote-sensing images have the essential properties of complex scene information and large-scale spatial changes, as well as the high similarity between various classes and the significant differences within the same class, which brings great challenges to scene classification. To address these issues, a multistage foreground-perception enhancement network (MFPENet) is proposed to enhance the ability to perceive foreground features, thereby improving classification accuracy. Firstly, to enrich the scene semantics of feature information, a multi-scale feature aggregation module is specifically designed using dilated convolution, which takes the features of different stages of the backbone network as input data to obtain enhanced multiscale features. Then, a novel foreground-perception enhancement module is designed to capture foreground information. Unlike the previous methods, we separate foreground features by designing feature masks and then innovatively explore the symbiotic relationship between foreground features and scene features to improve the recognition ability of foreground features further. Finally, a hierarchical attention module is designed to reduce the interference of redundant background details on classification. By embedding the dependence between adjacent level features into the attention mechanism, the model can pay more accurate attention to the key information. Redundancy is reduced, and the loss of useful information is minimized. Experiments on three public RS scene classification datasets [UC-Merced, Aerial Image Dataset, and NWPU-RESISC45] show that our method achieves highly competitive results. Future work will focus on utilizing the background features outside the effective foreground features in the image as a decision aid to improve the distinguishability between similar scenes. The source code of our proposed algorithm and the related datasets are available at https://github.com/Hpu-wcx/MFPENet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-12DOI: 10.1007/s00371-024-03589-8
Xintao Liu, Yan Gao, Changqing Zhan, Qiao Wangr, Yu Zhang, Yi He, Hongyan Quan
Excellent medical image segmentation plays an important role in computer-aided diagnosis. Deep mining of pixel semantics is crucial for medical image segmentation. However, previous works on medical semantic segmentation usually overlook the importance of embedding subspace, and lacked the mining of latent space direction information. In this work, we construct global orthogonal bases and channel orthogonal bases in the latent space, which can significantly enhance the feature representation. We propose a novel distance-based segmentation method that decouples the embedding space into sub-embedding spaces of different classes, and then implements pixel level classification based on the distance between its embedding features and the origin of the subspace. Experiments on various public medical image segmentation benchmarks have shown that our model is superior compared to state-of-the-art methods. The code will be published at https://github.com/lxt0525/LSDENet.
{"title":"Directional latent space representation for medical image segmentation","authors":"Xintao Liu, Yan Gao, Changqing Zhan, Qiao Wangr, Yu Zhang, Yi He, Hongyan Quan","doi":"10.1007/s00371-024-03589-8","DOIUrl":"https://doi.org/10.1007/s00371-024-03589-8","url":null,"abstract":"<p>Excellent medical image segmentation plays an important role in computer-aided diagnosis. Deep mining of pixel semantics is crucial for medical image segmentation. However, previous works on medical semantic segmentation usually overlook the importance of embedding subspace, and lacked the mining of latent space direction information. In this work, we construct global orthogonal bases and channel orthogonal bases in the latent space, which can significantly enhance the feature representation. We propose a novel distance-based segmentation method that decouples the embedding space into sub-embedding spaces of different classes, and then implements pixel level classification based on the distance between its embedding features and the origin of the subspace. Experiments on various public medical image segmentation benchmarks have shown that our model is superior compared to state-of-the-art methods. The code will be published at https://github.com/lxt0525/LSDENet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s00371-024-03290-w
Zhi Chen, Lijun Liu, Zhen Yu
The unmanned aerial vehicles (UAV) visual object tracking method based on the discriminative correlation filter (DCF) has gained extensive research and attention due to its superior computation and extraordinary progress, but is always suffers from unnecessary boundary effects. To solve the aforementioned problems, a spatial-temporal regularization correlation filter framework is proposed, which is achieved by introducing a constant regularization term to penalize the coefficients of the DCF filter. The tracker can substantially improve the tracking performance but increase computational complexity. However, these kinds of methods make the object fail to adapt to specific appearance variations, and we need to pay much effort in fine-tuning the spatial-temporal regularization weight coefficients. In this work, an adaptive spatial-temporal weighted regularization (ASTWR) model is proposed. An ASTWR module is introduced to obtain the weighted spatial-temporal regularization coefficients automatically. The proposed ASTWR model can deal effectively with complex situations and substantially improve the credibility of tracking results. In addition, an adaptive spatial-temporal constraint adjusting mechanism is proposed. By repressing the drastic appearance changes between adjacent frames, the tracker enables smooth filter learning in the detection phase. Substantial experiments show that the proposed tracker performs favorably against homogeneous UAV-based and DCF-based trackers. Moreover, the ASTWR tracker reaches over 35 FPS on a single CPU platform, and gains an AUC score of 57.9% and 49.7% on the UAV123 and VisDrone2020 datasets, respectively.
{"title":"Toward robust visual tracking for UAV with adaptive spatial-temporal weighted regularization","authors":"Zhi Chen, Lijun Liu, Zhen Yu","doi":"10.1007/s00371-024-03290-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03290-w","url":null,"abstract":"<p>The unmanned aerial vehicles (UAV) visual object tracking method based on the discriminative correlation filter (DCF) has gained extensive research and attention due to its superior computation and extraordinary progress, but is always suffers from unnecessary boundary effects. To solve the aforementioned problems, a spatial-temporal regularization correlation filter framework is proposed, which is achieved by introducing a constant regularization term to penalize the coefficients of the DCF filter. The tracker can substantially improve the tracking performance but increase computational complexity. However, these kinds of methods make the object fail to adapt to specific appearance variations, and we need to pay much effort in fine-tuning the spatial-temporal regularization weight coefficients. In this work, an adaptive spatial-temporal weighted regularization (ASTWR) model is proposed. An ASTWR module is introduced to obtain the weighted spatial-temporal regularization coefficients automatically. The proposed ASTWR model can deal effectively with complex situations and substantially improve the credibility of tracking results. In addition, an adaptive spatial-temporal constraint adjusting mechanism is proposed. By repressing the drastic appearance changes between adjacent frames, the tracker enables smooth filter learning in the detection phase. Substantial experiments show that the proposed tracker performs favorably against homogeneous UAV-based and DCF-based trackers. Moreover, the ASTWR tracker reaches over 35 FPS on a single CPU platform, and gains an AUC score of 57.9% and 49.7% on the UAV123 and VisDrone2020 datasets, respectively.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s00371-024-03586-x
Xunan Tan, Xiang Suo, Wenjun Li, Lei Bi, Fangshu Yao
Visualization analysis is crucial in healthcare as it provides insights into complex data and aids healthcare professionals in efficiency. Information visualization leverages algorithms to reduce the complexity of high-dimensional heterogeneous data, thereby enhancing healthcare professionals’ understanding of the hidden associations among data structures. In the field of healthcare visualization, efforts have been made to refine and enhance the utility of data through diverse algorithms and visualization techniques. This review aims to summarize the existing research in this domain and identify future research directions. We searched Web of Science, Google Scholar and IEEE Xplore databases, and ultimately, 76 articles were included in our analysis. We collected and synthesized the research findings from these articles, with a focus on visualization, artificial intelligence and supporting tasks in healthcare. Our study revealed that researchers from diverse fields have employed a wide range of visualization techniques to visualize various types of data. We summarized these visualization methods and proposed recommendations for future research. We anticipate that our findings will promote further development and application of visualization techniques in healthcare.
可视化分析在医疗保健领域至关重要,因为它能让人们深入了解复杂的数据,帮助医疗保健专业人员提高效率。信息可视化利用算法降低高维异构数据的复杂性,从而增强医疗保健专业人员对数据结构之间隐藏关联的理解。在医疗保健可视化领域,人们一直在努力通过各种算法和可视化技术来完善和提高数据的实用性。本综述旨在总结该领域的现有研究,并确定未来的研究方向。我们搜索了 Web of Science、Google Scholar 和 IEEE Xplore 数据库,最终有 76 篇文章被纳入我们的分析。我们收集并综合了这些文章的研究成果,重点关注医疗保健中的可视化、人工智能和支持任务。我们的研究显示,来自不同领域的研究人员采用了多种可视化技术来实现各类数据的可视化。我们总结了这些可视化方法,并为未来研究提出了建议。我们预计,我们的研究结果将促进可视化技术在医疗保健领域的进一步发展和应用。
{"title":"Data visualization in healthcare and medicine: a survey","authors":"Xunan Tan, Xiang Suo, Wenjun Li, Lei Bi, Fangshu Yao","doi":"10.1007/s00371-024-03586-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03586-x","url":null,"abstract":"<p>Visualization analysis is crucial in healthcare as it provides insights into complex data and aids healthcare professionals in efficiency. Information visualization leverages algorithms to reduce the complexity of high-dimensional heterogeneous data, thereby enhancing healthcare professionals’ understanding of the hidden associations among data structures. In the field of healthcare visualization, efforts have been made to refine and enhance the utility of data through diverse algorithms and visualization techniques. This review aims to summarize the existing research in this domain and identify future research directions. We searched Web of Science, Google Scholar and IEEE Xplore databases, and ultimately, 76 articles were included in our analysis. We collected and synthesized the research findings from these articles, with a focus on visualization, artificial intelligence and supporting tasks in healthcare. Our study revealed that researchers from diverse fields have employed a wide range of visualization techniques to visualize various types of data. We summarized these visualization methods and proposed recommendations for future research. We anticipate that our findings will promote further development and application of visualization techniques in healthcare.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-07DOI: 10.1007/s00371-024-03583-0
Chuang Wu, Tingqin He
Impurities and complex manufacturing processes result in many minor, dense steel defects. This situation requires precise defect detection models for effective protection. The single-stage model (based on YOLO) is a popular choice among current models, renowned for its computational efficiency and suitability for real-time online applications. However, existing YOLO-based models often fail to detect small features. To address this issue, we introduce an efficient steel surface defect detection model in YOLOv7, incorporating a feature preservation block (FPB) and location awareness feature pyramid network (LAFPN). The FPB uses shortcut connections that allow the upper layers to access detailed information directly, thus capturing minor defect features more effectively. Furthermore, LAFPN integrates coordinate data during the feature fusion phase, enhancing the detection of minor defects. We introduced a new loss function to identify and locate minor defects accurately. Extensive testing on two public datasets has demonstrated the superior performance of our model compared to five baseline models, achieving an impressive 80.8 mAP on the NEU-DET dataset and 72.6 mAP on the GC10-DET dataset.
{"title":"Efficient minor defects detection on steel surface via res-attention and position encoding","authors":"Chuang Wu, Tingqin He","doi":"10.1007/s00371-024-03583-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03583-0","url":null,"abstract":"<p>Impurities and complex manufacturing processes result in many minor, dense steel defects. This situation requires precise defect detection models for effective protection. The single-stage model (based on YOLO) is a popular choice among current models, renowned for its computational efficiency and suitability for real-time online applications. However, existing YOLO-based models often fail to detect small features. To address this issue, we introduce an efficient steel surface defect detection model in YOLOv7, incorporating a feature preservation block (FPB) and location awareness feature pyramid network (LAFPN). The FPB uses shortcut connections that allow the upper layers to access detailed information directly, thus capturing minor defect features more effectively. Furthermore, LAFPN integrates coordinate data during the feature fusion phase, enhancing the detection of minor defects. We introduced a new loss function to identify and locate minor defects accurately. Extensive testing on two public datasets has demonstrated the superior performance of our model compared to five baseline models, achieving an impressive 80.8 mAP on the NEU-DET dataset and 72.6 mAP on the GC10-DET dataset.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}