The Visual Computer最新文献_第3页

Filter-deform attention GAN: constructing human motion videos from few images 滤波变形注意力 GAN：从少量图像构建人体运动视频

The Visual Computer

Pub Date : 2024-08-26 DOI: 10.1007/s00371-024-03595-w

Jianjun Zhu, Huihuang Zhao, Yudong Zhang

Human motion transfer is challenging due to the complexity and diversity of human motion and clothing textures. Existing methods use 2D pose estimation to obtain poses, which can easily lead to unsmooth motion and artifacts. Therefore, this paper proposes a highly robust motion transmission model based on image deformation, called the Filter-Deform Attention Generative Adversarial Network (FDA GAN). This method can transmit complex human motion videos using only few human images. First, we use a 3D pose shape estimator instead of traditional 2D pose estimation to address the problem of unsmooth motion. Then, to tackle the artifact problem, we design a new attention mechanism and integrate it with the GAN, proposing a new network capable of effectively extracting image features and generating human motion videos. Finally, to further transfer the style of the source human, we propose a two-stream style loss, which enhances the model’s learning ability. Experimental results demonstrate that the proposed method outperforms recent methods in overall performance and various evaluation metrics. Project page: https://github.com/mioyeah/FDA-GAN.

由于人体运动和服装纹理的复杂性和多样性，人体运动传输具有挑战性。现有方法使用二维姿态估计来获取姿态，这很容易导致运动不平滑和伪影。因此，本文提出了一种基于图像变形的高鲁棒性运动传输模型，称为滤波-变形注意生成对抗网络（FDA GAN）。这种方法只需使用少量人体图像就能传输复杂的人体运动视频。首先，我们使用三维姿态形状估计器代替传统的二维姿态估计器来解决不平滑运动的问题。然后，为了解决伪影问题，我们设计了一种新的注意力机制，并将其与 GAN 相结合，提出了一种能够有效提取图像特征并生成人体运动视频的新网络。最后，为了进一步传递源人类的风格，我们提出了双流风格损失，从而增强了模型的学习能力。实验结果表明，所提出的方法在整体性能和各种评价指标上都优于近期的方法。项目页面：https://github.com/mioyeah/FDA-GAN.

引用次数: 0

GLDC: combining global and local consistency of multibranch depth completion GLDC：结合多分支深度补全的全局和局部一致性

The Visual Computer

Pub Date : 2024-08-26 DOI: 10.1007/s00371-024-03609-7

Yaping Deng, Yingjiang Li, Zibo Wei, Keying Li

Depth completion aims to generate dense depth maps from sparse depth maps and corresponding RGB images. In this task, the locality based on the convolutional layer poses challenges for the network in obtaining global information. While the Transformer-based architecture performs well in capturing global information, it may lead to the loss of local detail features. Consequently, improving the simultaneous attention to global and local information is crucial for achieving effective depth completion. This paper proposes a novel and effective dual-encoder–three-decoder network, consisting of local and global branches. Specifically, the local branch uses a convolutional network, and the global branch utilizes a Transformer network to extract rich features. Meanwhile, the local branch is dominated by color image and the global branch is dominated by depth map to thoroughly integrate and utilize multimodal information. In addition, a gate fusion mechanism is used in the decoder stage to fuse local and global information, to achieving high-performance depth completion. This hybrid architecture is conducive to the effective fusion of local detail information and contextual information. Experimental results demonstrated the superiority of our method over other advanced methods on KITTI Depth Completion and NYU v2 datasets.

深度补全旨在从稀疏的深度图和相应的 RGB 图像生成密集的深度图。在这项任务中，基于卷积层的局部性给网络获取全局信息带来了挑战。虽然基于变换器的架构在捕捉全局信息方面表现出色，但它可能会导致局部细节特征的丢失。因此，改进对全局和局部信息的同时关注对于实现有效的深度补全至关重要。本文提出了一种新颖有效的双编码器-三解码器网络，由局部和全局分支组成。具体来说，局部分支使用卷积网络，全局分支使用变换器网络来提取丰富的特征。同时，局部分支以彩色图像为主，全局分支以深度图为主，以全面整合和利用多模态信息。此外，在解码器阶段还使用了门融合机制来融合本地和全局信息，从而实现高性能的深度补全。这种混合架构有利于有效融合局部细节信息和上下文信息。实验结果表明，在 KITTI 深度补全和 NYU v2 数据集上，我们的方法优于其他先进方法。

{"title":"GLDC: combining global and local consistency of multibranch depth completion","authors":"Yaping Deng, Yingjiang Li, Zibo Wei, Keying Li","doi":"10.1007/s00371-024-03609-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03609-7","url":null,"abstract":"Depth completion aims to generate dense depth maps from sparse depth maps and corresponding RGB images. In this task, the locality based on the convolutional layer poses challenges for the network in obtaining global information. While the Transformer-based architecture performs well in capturing global information, it may lead to the loss of local detail features. Consequently, improving the simultaneous attention to global and local information is crucial for achieving effective depth completion. This paper proposes a novel and effective dual-encoder–three-decoder network, consisting of local and global branches. Specifically, the local branch uses a convolutional network, and the global branch utilizes a Transformer network to extract rich features. Meanwhile, the local branch is dominated by color image and the global branch is dominated by depth map to thoroughly integrate and utilize multimodal information. In addition, a gate fusion mechanism is used in the decoder stage to fuse local and global information, to achieving high-performance depth completion. This hybrid architecture is conducive to the effective fusion of local detail information and contextual information. Experimental results demonstrated the superiority of our method over other advanced methods on KITTI Depth Completion and NYU v2 datasets.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Orhlr-net: one-stage residual learning network for joint single-image specular highlight detection and removal Orhlr-net：用于联合检测和去除单张图像镜面高光的单级残差学习网络

The Visual Computer

Pub Date : 2024-08-24 DOI: 10.1007/s00371-024-03607-9

Wenzhe Shi, Ziqi Hu, Hao Chen, Hengjia Zhang, Jiale Yang, Li Li

Detecting and removing specular highlights is a complex task that can greatly enhance various visual tasks in real-world environments. Although previous works have made great progress, they often ignore specular highlight areas or produce unsatisfactory results with visual artifacts such as color distortion. In this paper, we present a framework that utilizes an encoder–decoder structure for the combined task of specular highlight detection and removal in single images, employing specular highlight mask guidance. The encoder uses EfficientNet as a feature extraction backbone network to convert the input RGB image into a series of feature maps. The decoder gradually restores these feature maps to their original size through up-sampling. In the specular highlight detection module, we enhance the network by utilizing residual modules to extract additional feature information, thereby improving detection accuracy. For the specular highlight removal module, we introduce the Convolutional Block Attention Module, which dynamically captures the importance of each channel and spatial location in the input feature map. This enables the model to effectively distinguish between foreground and background, resulting in enhanced adaptability and accuracy in complex scenes. We evaluate the proposed method on the publicly available SHIQ dataset, and its superiority is demonstrated through a comparative analysis of the experimental results. The source code will be available at https://github.com/hzq2333/ORHLR-Net.

检测和去除镜面高光是一项复杂的任务，可以极大地增强真实世界环境中各种视觉任务的效果。尽管之前的研究已经取得了很大进展，但它们往往忽略了镜面高光区域，或者产生了令人不满意的结果，如色彩失真等视觉假象。在本文中，我们提出了一个利用编码器-解码器结构的框架，通过镜面高光掩膜引导，在单幅图像中实现镜面高光检测和去除的组合任务。编码器使用 EfficientNet 作为特征提取骨干网络，将输入的 RGB 图像转换成一系列特征图。解码器通过向上采样，逐渐将这些特征图恢复到原始大小。在镜面高光检测模块中，我们通过利用残差模块提取额外的特征信息来增强网络，从而提高检测精度。在镜面高光去除模块中，我们引入了卷积块关注模块，该模块可动态捕捉输入特征图中每个通道和空间位置的重要性。这使得模型能够有效区分前景和背景，从而提高了在复杂场景中的适应性和准确性。我们在公开的 SHIQ 数据集上对所提出的方法进行了评估，并通过对实验结果的对比分析证明了该方法的优越性。源代码可在 https://github.com/hzq2333/ORHLR-Net 上获取。

{"title":"Orhlr-net: one-stage residual learning network for joint single-image specular highlight detection and removal","authors":"Wenzhe Shi, Ziqi Hu, Hao Chen, Hengjia Zhang, Jiale Yang, Li Li","doi":"10.1007/s00371-024-03607-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03607-9","url":null,"abstract":"Detecting and removing specular highlights is a complex task that can greatly enhance various visual tasks in real-world environments. Although previous works have made great progress, they often ignore specular highlight areas or produce unsatisfactory results with visual artifacts such as color distortion. In this paper, we present a framework that utilizes an encoder–decoder structure for the combined task of specular highlight detection and removal in single images, employing specular highlight mask guidance. The encoder uses EfficientNet as a feature extraction backbone network to convert the input RGB image into a series of feature maps. The decoder gradually restores these feature maps to their original size through up-sampling. In the specular highlight detection module, we enhance the network by utilizing residual modules to extract additional feature information, thereby improving detection accuracy. For the specular highlight removal module, we introduce the Convolutional Block Attention Module, which dynamically captures the importance of each channel and spatial location in the input feature map. This enables the model to effectively distinguish between foreground and background, resulting in enhanced adaptability and accuracy in complex scenes. We evaluate the proposed method on the publicly available SHIQ dataset, and its superiority is demonstrated through a comparative analysis of the experimental results. The source code will be available at https://github.com/hzq2333/ORHLR-Net.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Slot-VTON: subject-driven diffusion-based virtual try-on with slot attention Slot-VTON：基于主体驱动的扩散式虚拟试戴与插槽注意力

The Visual Computer

Pub Date : 2024-08-23 DOI: 10.1007/s00371-024-03603-z

Jianglei Ye, Yigang Wang, Fengmao Xie, Qin Wang, Xiaoling Gu, Zizhao Wu

Virtual try-on aims to transfer clothes from one image to another while preserving intricate wearer and clothing details. Tremendous efforts have been made to facilitate the task based on deep generative models such as GAN and diffusion models; however, the current methods have not taken into account the influence of the natural environment (background and unrelated impurities) on clothing image, leading to issues such as loss of detail, intricate textures, shadows, and folds. In this paper, we introduce Slot-VTON, a slot attention-based inpainting approach for seamless image generation in a subject-driven way. Specifically, we adopt an attention mechanism, termed slot attention, that can unsupervisedly separate the various subjects within images. With slot attention, we distill the clothing image into a series of slot representations, where each slot represents a subject. Guided by the extracted clothing slot, our method is capable of eliminating the interference of other unnecessary factors, thereby better preserving the complex details of the clothing. To further enhance the seamless generation of the diffusion model, we design a fusion adapter that integrates multiple conditions, including the slot and other added clothing conditions. In addition, a non-garment inpainting module is used to further fix visible seams and preserve non-clothing area details (hands, neck, etc.). Multiple experiments on VITON-HD datasets validate the efficacy of our methods, showcasing state-of-the-art generation performances. Our implementation is available at: https://github.com/SilverLakee/Slot-VTON.

虚拟试穿旨在将服装从一张图像转移到另一张图像，同时保留复杂的穿着者和服装细节。然而，目前的方法没有考虑到自然环境（背景和无关杂质）对服装图像的影响，从而导致细节、复杂纹理、阴影和褶皱等问题的丢失。在本文中，我们介绍了 Slot-VTON，这是一种基于插槽注意力的 Inpainting 方法，以主体驱动的方式生成无缝图像。具体来说，我们采用了一种称为 "槽注意"（slot attention）的注意机制，它可以无监督地分离图像中的各种主体。通过插槽关注，我们将服装图像提炼为一系列插槽表示，其中每个插槽代表一个主体。在提取的服装槽的引导下，我们的方法能够排除其他不必要因素的干扰，从而更好地保留服装的复杂细节。为了进一步增强扩散模型的无缝生成，我们设计了一种融合适配器，它能整合多种条件，包括槽和其他附加的服装条件。此外，我们还使用了一个非服装涂色模块来进一步修复可见接缝并保留非服装区域的细节（手部、颈部等）。在 VITON-HD 数据集上进行的多次实验验证了我们方法的有效性，展示了最先进的生成性能。我们的实现方法可在以下网址获取：https://github.com/SilverLakee/Slot-VTON.

{"title":"Slot-VTON: subject-driven diffusion-based virtual try-on with slot attention","authors":"Jianglei Ye, Yigang Wang, Fengmao Xie, Qin Wang, Xiaoling Gu, Zizhao Wu","doi":"10.1007/s00371-024-03603-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03603-z","url":null,"abstract":"Virtual try-on aims to transfer clothes from one image to another while preserving intricate wearer and clothing details. Tremendous efforts have been made to facilitate the task based on deep generative models such as GAN and diffusion models; however, the current methods have not taken into account the influence of the natural environment (background and unrelated impurities) on clothing image, leading to issues such as loss of detail, intricate textures, shadows, and folds. In this paper, we introduce Slot-VTON, a slot attention-based inpainting approach for seamless image generation in a subject-driven way. Specifically, we adopt an attention mechanism, termed slot attention, that can unsupervisedly separate the various subjects within images. With slot attention, we distill the clothing image into a series of slot representations, where each slot represents a subject. Guided by the extracted clothing slot, our method is capable of eliminating the interference of other unnecessary factors, thereby better preserving the complex details of the clothing. To further enhance the seamless generation of the diffusion model, we design a fusion adapter that integrates multiple conditions, including the slot and other added clothing conditions. In addition, a non-garment inpainting module is used to further fix visible seams and preserve non-clothing area details (hands, neck, etc.). Multiple experiments on VITON-HD datasets validate the efficacy of our methods, showcasing state-of-the-art generation performances. Our implementation is available at: https://github.com/SilverLakee/Slot-VTON.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EGCT: enhanced graph convolutional transformer for 3D point cloud representation learning EGCT：用于三维点云表示学习的增强型图卷积变换器

The Visual Computer

Pub Date : 2024-08-23 DOI: 10.1007/s00371-024-03600-2

Gang Chen, Wenju Wang, Haoran Zhou, Xiaolin Wang

It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features. Therefore, we explore transformer on topological structures of point cloud graphs, proposing an enhanced graph convolutional transformer (EGCT) method. EGCT construct graph topology for disordered and unstructured point cloud. Then it uses the enhanced point feature representation method to further aggregate the feature information of all neighborhood points, which can compactly represent the features of this local neighborhood graph. Subsequent process, the graph convolutional transformer simultaneously performs self-attention calculations and convolution operations on the point coordinates and features of the neighborhood graph. It efficiently utilizes the spatial geometric information of point cloud objects. Therefore, EGCT learns comprehensive geometric information of point cloud objects, which can help to improve segmentation and classification accuracy. On the ShapeNetPart and ModelNet40 datasets, our EGCT method achieves a mIoU of 86.8%, OA and AA of 93.5% and 91.2%, respectively. On the large-scale indoor scene point cloud dataset (S3DIS), the OA of EGCT method is 90.1%, and the mIoU is 67.8%. Experimental results demonstrate that our EGCT method can achieve comparable point cloud segmentation and classification performance to state-of-the-art methods while maintaining low model complexity. Our source code is available at https://github.com/shepherds001/EGCT.

对点云数据进行表征学习，完成局部和全局特征信息的同步获取，是高精度三维环境感知亟待解决的问题。然而，目前的表征学习方法要么只关注如何高效地学习局部特征，要么只捕捉到长距离依赖关系，却丢失了细粒度特征。因此，我们探索了点云图拓扑结构的变换器，提出了增强图卷积变换器（EGCT）方法。EGCT 为无序和非结构化的点云构建图拓扑结构。然后，它使用增强点特征表示方法进一步聚合所有邻域点的特征信息，从而可以紧凑地表示该局部邻域图的特征。在随后的处理过程中，图卷积变换器会同时对邻域图中的点坐标和特征进行自注意计算和卷积操作。它有效地利用了点云对象的空间几何信息。因此，EGCT 可以学习点云对象的全面几何信息，有助于提高分割和分类的准确性。在 ShapeNetPart 和 ModelNet40 数据集上，我们的 EGCT 方法的 mIoU 达到 86.8%，OA 和 AA 分别达到 93.5% 和 91.2%。在大规模室内场景点云数据集（S3DIS）上，EGCT 方法的 OA 为 90.1%，mIoU 为 67.8%。实验结果表明，我们的 EGCT 方法可以在保持较低模型复杂度的同时，实现与最先进方法相当的点云分割和分类性能。我们的源代码可在 https://github.com/shepherds001/EGCT 上获取。

{"title":"EGCT: enhanced graph convolutional transformer for 3D point cloud representation learning","authors":"Gang Chen, Wenju Wang, Haoran Zhou, Xiaolin Wang","doi":"10.1007/s00371-024-03600-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03600-2","url":null,"abstract":"It is an urgent problem of high-precision 3D environment perception to carry out representation learning on point cloud data, which complete the synchronous acquisition of local and global feature information. However, current representation learning methods either only focus on how to efficiently learn local features, or capture long-distance dependencies but lose the fine-grained features. Therefore, we explore transformer on topological structures of point cloud graphs, proposing an enhanced graph convolutional transformer (EGCT) method. EGCT construct graph topology for disordered and unstructured point cloud. Then it uses the enhanced point feature representation method to further aggregate the feature information of all neighborhood points, which can compactly represent the features of this local neighborhood graph. Subsequent process, the graph convolutional transformer simultaneously performs self-attention calculations and convolution operations on the point coordinates and features of the neighborhood graph. It efficiently utilizes the spatial geometric information of point cloud objects. Therefore, EGCT learns comprehensive geometric information of point cloud objects, which can help to improve segmentation and classification accuracy. On the ShapeNetPart and ModelNet40 datasets, our EGCT method achieves a mIoU of 86.8%, OA and AA of 93.5% and 91.2%, respectively. On the large-scale indoor scene point cloud dataset (S3DIS), the OA of EGCT method is 90.1%, and the mIoU is 67.8%. Experimental results demonstrate that our EGCT method can achieve comparable point cloud segmentation and classification performance to state-of-the-art methods while maintaining low model complexity. Our source code is available at https://github.com/shepherds001/EGCT.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mfpenet: multistage foreground-perception enhancement network for remote-sensing scene classification Mfpenet：用于遥感场景分类的多级前景感知增强网络

The Visual Computer

Pub Date : 2024-08-13 DOI: 10.1007/s00371-024-03587-w

Junding Sun, Chenxu Wang, Haifeng Sima, Xiaosheng Wu, Shuihua Wang, Yudong Zhang

Scene classification plays a vital role in the field of remote-sensing (RS). However, remote-sensing images have the essential properties of complex scene information and large-scale spatial changes, as well as the high similarity between various classes and the significant differences within the same class, which brings great challenges to scene classification. To address these issues, a multistage foreground-perception enhancement network (MFPENet) is proposed to enhance the ability to perceive foreground features, thereby improving classification accuracy. Firstly, to enrich the scene semantics of feature information, a multi-scale feature aggregation module is specifically designed using dilated convolution, which takes the features of different stages of the backbone network as input data to obtain enhanced multiscale features. Then, a novel foreground-perception enhancement module is designed to capture foreground information. Unlike the previous methods, we separate foreground features by designing feature masks and then innovatively explore the symbiotic relationship between foreground features and scene features to improve the recognition ability of foreground features further. Finally, a hierarchical attention module is designed to reduce the interference of redundant background details on classification. By embedding the dependence between adjacent level features into the attention mechanism, the model can pay more accurate attention to the key information. Redundancy is reduced, and the loss of useful information is minimized. Experiments on three public RS scene classification datasets [UC-Merced, Aerial Image Dataset, and NWPU-RESISC45] show that our method achieves highly competitive results. Future work will focus on utilizing the background features outside the effective foreground features in the image as a decision aid to improve the distinguishability between similar scenes. The source code of our proposed algorithm and the related datasets are available at https://github.com/Hpu-wcx/MFPENet.

场景分类在遥感（RS）领域发挥着重要作用。然而，遥感图像具有场景信息复杂、空间变化幅度大、不同类别之间相似性高、同一类别内部差异显著等基本特性，这给场景分类带来了巨大挑战。针对这些问题，提出了一种多级前景感知增强网络（MFPENet），以增强对前景特征的感知能力，从而提高分类精度。首先，为了丰富特征信息的场景语义，利用扩张卷积技术专门设计了一个多尺度特征聚合模块，将骨干网络不同阶段的特征作为输入数据，得到增强的多尺度特征。然后，我们设计了一个新颖的前景感知增强模块来捕捉前景信息。与以往方法不同的是，我们通过设计特征掩码来分离前景特征，然后创新性地探索前景特征与场景特征之间的共生关系，进一步提高前景特征的识别能力。最后，我们设计了分层注意力模块，以减少冗余背景细节对分类的干扰。通过将相邻层级特征之间的依赖关系嵌入关注机制，模型可以更准确地关注关键信息。冗余减少了，有用信息的损失也降到了最低。在三个公共 RS 场景分类数据集 [UC-Merced、Aerial Image Dataset 和 NWPU-RESISC45]上的实验表明，我们的方法取得了极具竞争力的结果。未来的工作将侧重于利用图像中有效前景特征之外的背景特征作为辅助决策工具，以提高相似场景之间的可区分性。我们提出的算法源代码和相关数据集可在 https://github.com/Hpu-wcx/MFPENet 网站上获取。

{"title":"Mfpenet: multistage foreground-perception enhancement network for remote-sensing scene classification","authors":"Junding Sun, Chenxu Wang, Haifeng Sima, Xiaosheng Wu, Shuihua Wang, Yudong Zhang","doi":"10.1007/s00371-024-03587-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03587-w","url":null,"abstract":"Scene classification plays a vital role in the field of remote-sensing (RS). However, remote-sensing images have the essential properties of complex scene information and large-scale spatial changes, as well as the high similarity between various classes and the significant differences within the same class, which brings great challenges to scene classification. To address these issues, a multistage foreground-perception enhancement network (MFPENet) is proposed to enhance the ability to perceive foreground features, thereby improving classification accuracy. Firstly, to enrich the scene semantics of feature information, a multi-scale feature aggregation module is specifically designed using dilated convolution, which takes the features of different stages of the backbone network as input data to obtain enhanced multiscale features. Then, a novel foreground-perception enhancement module is designed to capture foreground information. Unlike the previous methods, we separate foreground features by designing feature masks and then innovatively explore the symbiotic relationship between foreground features and scene features to improve the recognition ability of foreground features further. Finally, a hierarchical attention module is designed to reduce the interference of redundant background details on classification. By embedding the dependence between adjacent level features into the attention mechanism, the model can pay more accurate attention to the key information. Redundancy is reduced, and the loss of useful information is minimized. Experiments on three public RS scene classification datasets [UC-Merced, Aerial Image Dataset, and NWPU-RESISC45] show that our method achieves highly competitive results. Future work will focus on utilizing the background features outside the effective foreground features in the image as a decision aid to improve the distinguishability between similar scenes. The source code of our proposed algorithm and the related datasets are available at https://github.com/Hpu-wcx/MFPENet.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Directional latent space representation for medical image segmentation 用于医学图像分割的定向潜空间表示法

The Visual Computer

Pub Date : 2024-08-12 DOI: 10.1007/s00371-024-03589-8

Xintao Liu, Yan Gao, Changqing Zhan, Qiao Wangr, Yu Zhang, Yi He, Hongyan Quan

Excellent medical image segmentation plays an important role in computer-aided diagnosis. Deep mining of pixel semantics is crucial for medical image segmentation. However, previous works on medical semantic segmentation usually overlook the importance of embedding subspace, and lacked the mining of latent space direction information. In this work, we construct global orthogonal bases and channel orthogonal bases in the latent space, which can significantly enhance the feature representation. We propose a novel distance-based segmentation method that decouples the embedding space into sub-embedding spaces of different classes, and then implements pixel level classification based on the distance between its embedding features and the origin of the subspace. Experiments on various public medical image segmentation benchmarks have shown that our model is superior compared to state-of-the-art methods. The code will be published at https://github.com/lxt0525/LSDENet.

出色的医学图像分割在计算机辅助诊断中发挥着重要作用。像素语义的深度挖掘对于医学图像分割至关重要。然而，以往的医学语义分割研究通常忽略了嵌入子空间的重要性，缺乏对潜在空间方向信息的挖掘。在这项工作中，我们在潜空间中构建了全局正交基和通道正交基，这可以显著增强特征表示。我们提出了一种新颖的基于距离的分割方法，将嵌入空间解耦为不同类别的子嵌入空间，然后根据其嵌入特征与子空间原点之间的距离实现像素级分类。对各种公共医疗图像分割基准的实验表明，我们的模型优于最先进的方法。代码将发布在 https://github.com/lxt0525/LSDENet 上。

引用次数: 0

Toward robust visual tracking for UAV with adaptive spatial-temporal weighted regularization 利用自适应时空加权正则化实现无人机的鲁棒视觉跟踪

The Visual Computer

Pub Date : 2024-08-07 DOI: 10.1007/s00371-024-03290-w

Zhi Chen, Lijun Liu, Zhen Yu

The unmanned aerial vehicles (UAV) visual object tracking method based on the discriminative correlation filter (DCF) has gained extensive research and attention due to its superior computation and extraordinary progress, but is always suffers from unnecessary boundary effects. To solve the aforementioned problems, a spatial-temporal regularization correlation filter framework is proposed, which is achieved by introducing a constant regularization term to penalize the coefficients of the DCF filter. The tracker can substantially improve the tracking performance but increase computational complexity. However, these kinds of methods make the object fail to adapt to specific appearance variations, and we need to pay much effort in fine-tuning the spatial-temporal regularization weight coefficients. In this work, an adaptive spatial-temporal weighted regularization (ASTWR) model is proposed. An ASTWR module is introduced to obtain the weighted spatial-temporal regularization coefficients automatically. The proposed ASTWR model can deal effectively with complex situations and substantially improve the credibility of tracking results. In addition, an adaptive spatial-temporal constraint adjusting mechanism is proposed. By repressing the drastic appearance changes between adjacent frames, the tracker enables smooth filter learning in the detection phase. Substantial experiments show that the proposed tracker performs favorably against homogeneous UAV-based and DCF-based trackers. Moreover, the ASTWR tracker reaches over 35 FPS on a single CPU platform, and gains an AUC score of 57.9% and 49.7% on the UAV123 and VisDrone2020 datasets, respectively.

基于判别相关滤波器（DCF）的无人机（UAV）视觉物体跟踪方法因其优越的计算性能和非凡的进展而获得了广泛的研究和关注，但它总是受到不必要的边界效应的影响。为解决上述问题，本文提出了一种时空正则化相关滤波器框架，通过引入常数正则化项对 DCF 滤波器的系数进行惩罚来实现。这种跟踪器可以大幅提高跟踪性能，但会增加计算复杂度。然而，这类方法会使物体无法适应特定的外观变化，我们需要在微调时空正则化权重系数上花费大量精力。本文提出了一种自适应时空加权正则化（ASTWR）模型。ASTWR 模块用于自动获取加权时空正则化系数。所提出的 ASTWR 模型能有效应对复杂情况，大大提高跟踪结果的可信度。此外，还提出了一种自适应时空约束调整机制。通过抑制相邻帧之间剧烈的外观变化，跟踪器可以在检测阶段实现平滑的滤波器学习。大量实验表明，与同质的基于无人机和基于 DCF 的跟踪器相比，所提出的跟踪器表现出色。此外，ASTWR 追踪器在单 CPU 平台上的速度超过 35 FPS，在 UAV123 和 VisDrone2020 数据集上的 AUC 分数分别为 57.9% 和 49.7%。

{"title":"Toward robust visual tracking for UAV with adaptive spatial-temporal weighted regularization","authors":"Zhi Chen, Lijun Liu, Zhen Yu","doi":"10.1007/s00371-024-03290-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03290-w","url":null,"abstract":"The unmanned aerial vehicles (UAV) visual object tracking method based on the discriminative correlation filter (DCF) has gained extensive research and attention due to its superior computation and extraordinary progress, but is always suffers from unnecessary boundary effects. To solve the aforementioned problems, a spatial-temporal regularization correlation filter framework is proposed, which is achieved by introducing a constant regularization term to penalize the coefficients of the DCF filter. The tracker can substantially improve the tracking performance but increase computational complexity. However, these kinds of methods make the object fail to adapt to specific appearance variations, and we need to pay much effort in fine-tuning the spatial-temporal regularization weight coefficients. In this work, an adaptive spatial-temporal weighted regularization (ASTWR) model is proposed. An ASTWR module is introduced to obtain the weighted spatial-temporal regularization coefficients automatically. The proposed ASTWR model can deal effectively with complex situations and substantially improve the credibility of tracking results. In addition, an adaptive spatial-temporal constraint adjusting mechanism is proposed. By repressing the drastic appearance changes between adjacent frames, the tracker enables smooth filter learning in the detection phase. Substantial experiments show that the proposed tracker performs favorably against homogeneous UAV-based and DCF-based trackers. Moreover, the ASTWR tracker reaches over 35 FPS on a single CPU platform, and gains an AUC score of 57.9% and 49.7% on the UAV123 and VisDrone2020 datasets, respectively.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data visualization in healthcare and medicine: a survey 医疗保健中的数据可视化：调查

The Visual Computer

Pub Date : 2024-08-07 DOI: 10.1007/s00371-024-03586-x

Xunan Tan, Xiang Suo, Wenjun Li, Lei Bi, Fangshu Yao

Visualization analysis is crucial in healthcare as it provides insights into complex data and aids healthcare professionals in efficiency. Information visualization leverages algorithms to reduce the complexity of high-dimensional heterogeneous data, thereby enhancing healthcare professionals’ understanding of the hidden associations among data structures. In the field of healthcare visualization, efforts have been made to refine and enhance the utility of data through diverse algorithms and visualization techniques. This review aims to summarize the existing research in this domain and identify future research directions. We searched Web of Science, Google Scholar and IEEE Xplore databases, and ultimately, 76 articles were included in our analysis. We collected and synthesized the research findings from these articles, with a focus on visualization, artificial intelligence and supporting tasks in healthcare. Our study revealed that researchers from diverse fields have employed a wide range of visualization techniques to visualize various types of data. We summarized these visualization methods and proposed recommendations for future research. We anticipate that our findings will promote further development and application of visualization techniques in healthcare.

可视化分析在医疗保健领域至关重要，因为它能让人们深入了解复杂的数据，帮助医疗保健专业人员提高效率。信息可视化利用算法降低高维异构数据的复杂性，从而增强医疗保健专业人员对数据结构之间隐藏关联的理解。在医疗保健可视化领域，人们一直在努力通过各种算法和可视化技术来完善和提高数据的实用性。本综述旨在总结该领域的现有研究，并确定未来的研究方向。我们搜索了 Web of Science、Google Scholar 和 IEEE Xplore 数据库，最终有 76 篇文章被纳入我们的分析。我们收集并综合了这些文章的研究成果，重点关注医疗保健中的可视化、人工智能和支持任务。我们的研究显示，来自不同领域的研究人员采用了多种可视化技术来实现各类数据的可视化。我们总结了这些可视化方法，并为未来研究提出了建议。我们预计，我们的研究结果将促进可视化技术在医疗保健领域的进一步发展和应用。

{"title":"Data visualization in healthcare and medicine: a survey","authors":"Xunan Tan, Xiang Suo, Wenjun Li, Lei Bi, Fangshu Yao","doi":"10.1007/s00371-024-03586-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03586-x","url":null,"abstract":"Visualization analysis is crucial in healthcare as it provides insights into complex data and aids healthcare professionals in efficiency. Information visualization leverages algorithms to reduce the complexity of high-dimensional heterogeneous data, thereby enhancing healthcare professionals’ understanding of the hidden associations among data structures. In the field of healthcare visualization, efforts have been made to refine and enhance the utility of data through diverse algorithms and visualization techniques. This review aims to summarize the existing research in this domain and identify future research directions. We searched Web of Science, Google Scholar and IEEE Xplore databases, and ultimately, 76 articles were included in our analysis. We collected and synthesized the research findings from these articles, with a focus on visualization, artificial intelligence and supporting tasks in healthcare. Our study revealed that researchers from diverse fields have employed a wide range of visualization techniques to visualize various types of data. We summarized these visualization methods and proposed recommendations for future research. We anticipate that our findings will promote further development and application of visualization techniques in healthcare.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient minor defects detection on steel surface via res-attention and position encoding 通过重定向和位置编码有效检测钢表面的微小缺陷

The Visual Computer

Pub Date : 2024-08-07 DOI: 10.1007/s00371-024-03583-0

Chuang Wu, Tingqin He

Impurities and complex manufacturing processes result in many minor, dense steel defects. This situation requires precise defect detection models for effective protection. The single-stage model (based on YOLO) is a popular choice among current models, renowned for its computational efficiency and suitability for real-time online applications. However, existing YOLO-based models often fail to detect small features. To address this issue, we introduce an efficient steel surface defect detection model in YOLOv7, incorporating a feature preservation block (FPB) and location awareness feature pyramid network (LAFPN). The FPB uses shortcut connections that allow the upper layers to access detailed information directly, thus capturing minor defect features more effectively. Furthermore, LAFPN integrates coordinate data during the feature fusion phase, enhancing the detection of minor defects. We introduced a new loss function to identify and locate minor defects accurately. Extensive testing on two public datasets has demonstrated the superior performance of our model compared to five baseline models, achieving an impressive 80.8 mAP on the NEU-DET dataset and 72.6 mAP on the GC10-DET dataset.

杂质和复杂的生产工艺会导致许多细微、密集的钢材缺陷。这种情况需要精确的缺陷检测模型来进行有效保护。单级模型（基于 YOLO）因其计算效率高和适合实时在线应用而闻名，是当前模型中的热门选择。然而，现有的基于 YOLO 的模型往往无法检测到小特征。为了解决这个问题，我们在 YOLOv7 中引入了一种高效的钢表面缺陷检测模型，其中包含一个特征保存块（FPB）和位置感知特征金字塔网络（LAFPN）。FPB 使用快捷连接，允许上层直接访问详细信息，从而更有效地捕捉细微缺陷特征。此外，LAFPN 在特征融合阶段整合了坐标数据，从而增强了对小缺陷的检测。我们引入了一个新的损失函数来准确识别和定位小缺陷。在两个公共数据集上进行的广泛测试表明，与五个基准模型相比，我们的模型性能更优越，在 NEU-DET 数据集上达到了令人印象深刻的 80.8 mAP，在 GC10-DET 数据集上达到了 72.6 mAP。

{"title":"Efficient minor defects detection on steel surface via res-attention and position encoding","authors":"Chuang Wu, Tingqin He","doi":"10.1007/s00371-024-03583-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03583-0","url":null,"abstract":"Impurities and complex manufacturing processes result in many minor, dense steel defects. This situation requires precise defect detection models for effective protection. The single-stage model (based on YOLO) is a popular choice among current models, renowned for its computational efficiency and suitability for real-time online applications. However, existing YOLO-based models often fail to detect small features. To address this issue, we introduce an efficient steel surface defect detection model in YOLOv7, incorporating a feature preservation block (FPB) and location awareness feature pyramid network (LAFPN). The FPB uses shortcut connections that allow the upper layers to access detailed information directly, thus capturing minor defect features more effectively. Furthermore, LAFPN integrates coordinate data during the feature fusion phase, enhancing the detection of minor defects. We introduced a new loss function to identify and locate minor defects accurately. Extensive testing on two public datasets has demonstrated the superior performance of our model compared to five baseline models, achieving an impressive 80.8 mAP on the NEU-DET dataset and 72.6 mAP on the GC10-DET dataset.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141942704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0