The Visual Computer最新文献

Advanced deepfake detection with enhanced Resnet-18 and multilayer CNN max pooling 利用增强型 Resnet-18 和多层 CNN 最大池进行高级深度伪造检测

The Visual Computer

Pub Date : 2024-09-18 DOI: 10.1007/s00371-024-03613-x

Muhammad Fahad, Tao Zhang, Yasir Iqbal, Azaz Ikram, Fazeela Siddiqui, Bin Younas Abdullah, Malik Muhammad Nauman, Xin Zhao, Yanzhang Geng

Artificial intelligence has revolutionized technology, with generative adversarial networks (GANs) generating fake samples and deepfake videos. These technologies can lead to panic and instability, allowing anyone to produce propaganda. Therefore, it is crucial to develop a robust system to distinguish between authentic and counterfeit information in the current social media era. This study offers an automated approach for categorizing deepfake videos using advanced machine learning and deep learning techniques. The processed videos are classified using a deep learning-based enhanced Resnet-18 with convolutional neural network (CNN) multilayer max pooling. This research contributes to studying precise detection techniques for deepfake technology, which is gradually becoming a serious problem for digital media. The proposed enhanced Resnet-18 CNN method integrates deep learning algorithms on GAN architecture and artificial intelligence-generated videos to analyze and determine genuine and fake videos. In this research, we fuse the sub-datasets (faceswap, face2face, deepfakes, neural textures) of FaceForensics, CelebDF, DeeperForensics, DeepFake detection and our own created private dataset into one combined dataset, and the total number of videos are (11,404) in this fused dataset. The dataset on which it was trained has a diverse range of videos and sentiments, demonstrating its capability. The structure of the model is designed to predict and identify videos with faces accurately switched as fakes, while those without switches are real. This paper is a great leap forward in the area of digital forensics, providing an excellent response to deepfakes. The proposed model outperformed conventional methods in predicting video frames, with an accuracy score of 99.99%, F-score of 99.98%, recall of 100%, and precision of 99.99%, confirming its effectiveness through a comparative analysis. The source code of this study is available publically at https://doi.org/10.5281/zenodo.12538330.

人工智能带来了技术革命，生成式对抗网络（GAN）可以生成虚假样本和深度伪造视频。这些技术可导致恐慌和不稳定，使任何人都能制作宣传品。因此，在当前的社交媒体时代，开发一个强大的系统来区分真假信息至关重要。本研究提供了一种利用先进的机器学习和深度学习技术对深度伪造视频进行分类的自动化方法。使用基于深度学习的增强型 Resnet-18 和卷积神经网络（CNN）多层最大集合对处理过的视频进行分类。这项研究有助于研究深度伪造技术的精确检测技术，该技术正逐渐成为数字媒体的一个严重问题。所提出的增强型 Resnet-18 CNN 方法整合了 GAN 架构上的深度学习算法和人工智能生成的视频，以分析和判断真假视频。在这项研究中，我们将 FaceForensics、CelebDF、DeeperForensics、DeepFake 检测的子数据集（faceswap、face2face、deepfakes、neural textures）和我们自己创建的私有数据集融合为一个组合数据集，在这个融合数据集中的视频总数为（11404）。训练该模型的数据集包含各种视频和情绪，这充分证明了该模型的能力。该模型的结构旨在预测和识别人脸被准确切换的视频为假视频，而没有切换的视频为真视频。本文是数字取证领域的一大飞跃，为深度伪造提供了出色的应对措施。所提出的模型在预测视频帧方面优于传统方法，准确率达 99.99%，F 值达 99.98%，召回率达 100%，精确率达 99.99%，通过对比分析证实了其有效性。本研究的源代码可在 https://doi.org/10.5281/zenodo.12538330 网站上公开获取。

{"title":"Advanced deepfake detection with enhanced Resnet-18 and multilayer CNN max pooling","authors":"Muhammad Fahad, Tao Zhang, Yasir Iqbal, Azaz Ikram, Fazeela Siddiqui, Bin Younas Abdullah, Malik Muhammad Nauman, Xin Zhao, Yanzhang Geng","doi":"10.1007/s00371-024-03613-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03613-x","url":null,"abstract":"Artificial intelligence has revolutionized technology, with generative adversarial networks (GANs) generating fake samples and deepfake videos. These technologies can lead to panic and instability, allowing anyone to produce propaganda. Therefore, it is crucial to develop a robust system to distinguish between authentic and counterfeit information in the current social media era. This study offers an automated approach for categorizing deepfake videos using advanced machine learning and deep learning techniques. The processed videos are classified using a deep learning-based enhanced Resnet-18 with convolutional neural network (CNN) multilayer max pooling. This research contributes to studying precise detection techniques for deepfake technology, which is gradually becoming a serious problem for digital media. The proposed enhanced Resnet-18 CNN method integrates deep learning algorithms on GAN architecture and artificial intelligence-generated videos to analyze and determine genuine and fake videos. In this research, we fuse the sub-datasets (faceswap, face2face, deepfakes, neural textures) of FaceForensics, CelebDF, DeeperForensics, DeepFake detection and our own created private dataset into one combined dataset, and the total number of videos are (11,404) in this fused dataset. The dataset on which it was trained has a diverse range of videos and sentiments, demonstrating its capability. The structure of the model is designed to predict and identify videos with faces accurately switched as fakes, while those without switches are real. This paper is a great leap forward in the area of digital forensics, providing an excellent response to deepfakes. The proposed model outperformed conventional methods in predicting video frames, with an accuracy score of 99.99%, F-score of 99.98%, recall of 100%, and precision of 99.99%, confirming its effectiveness through a comparative analysis. The source code of this study is available publically at https://doi.org/10.5281/zenodo.12538330.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Video-driven musical composition using large language model with memory-augmented state space 使用具有记忆增强状态空间的大型语言模型进行视频驱动的音乐创作

The Visual Computer

Pub Date : 2024-09-18 DOI: 10.1007/s00371-024-03606-w

Wan-He Kai, Kai-Xin Xing

The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. However, the research work on LLms for music inspiration is still in its infancy. To fill the gap in this field and break through the dilemma that LLMs can only understand short videos with limited frames, we propose a large language model with state space for long-term video-to-music generation. To capture long-range dependency and maintaining high performance, while further decrease the computing cost, our overall network includes the Enhanced Video Mamba, which incorporates continuous moving window partitioning and local feature augmentation, and a long-term memory bank that captures and aggregates historical video information to mitigate information loss in long sequences. This framework achieves both subquadratic-time computation and near-linear memory complexity, enabling effective long-term video-to-music generation. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models. Our code released on https://github.com/kai211233/S2L2-V2M.

目前，利用大型语言模型（LLMs）的研究正在蓬勃发展。许多作品利用这些模型强大的推理能力来理解各种模式，如文本、语音、图像、视频等。然而，针对音乐灵感的大语言模型研究工作仍处于起步阶段。为了填补这一领域的空白，突破 LLM 只能理解帧数有限的短视频的窘境，我们提出了一种具有状态空间的大型语言模型，用于从视频到音乐的长期生成。为了捕捉长距离依赖性并保持高性能，同时进一步降低计算成本，我们的整体网络包括增强型视频曼巴（Enhanced Video Mamba），它集成了连续移动窗口分割和局部特征增强功能，以及一个长期记忆库（用于捕捉和聚合历史视频信息，以减少长序列中的信息丢失）。该框架可实现亚二次方时间计算和近线性内存复杂性，从而实现有效的长期视频到音乐生成。我们对所提出的框架进行了全面评估。实验结果表明，我们的模型达到或超过了当前最先进模型的性能。我们的代码发布于 https://github.com/kai211233/S2L2-V2M。

{"title":"Video-driven musical composition using large language model with memory-augmented state space","authors":"Wan-He Kai, Kai-Xin Xing","doi":"10.1007/s00371-024-03606-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03606-w","url":null,"abstract":"The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. However, the research work on LLms for music inspiration is still in its infancy. To fill the gap in this field and break through the dilemma that LLMs can only understand short videos with limited frames, we propose a large language model with state space for long-term video-to-music generation. To capture long-range dependency and maintaining high performance, while further decrease the computing cost, our overall network includes the Enhanced Video Mamba, which incorporates continuous moving window partitioning and local feature augmentation, and a long-term memory bank that captures and aggregates historical video information to mitigate information loss in long sequences. This framework achieves both subquadratic-time computation and near-linear memory complexity, enabling effective long-term video-to-music generation. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models. Our code released on https://github.com/kai211233/S2L2-V2M.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

3D human pose estimation using spatiotemporal hypergraphs and its public benchmark on opera videos 利用时空超图进行三维人体姿态估计及其在歌剧视频上的公开基准测试

The Visual Computer

Pub Date : 2024-09-17 DOI: 10.1007/s00371-024-03604-y

Xingquan Cai, Haoyu Zhang, LiZhe Chen, YiJie Wu, Haiyan Sun

Graph convolutional networks significantly improve the 3D human pose estimation accuracy by representing the human skeleton as an undirected spatiotemporal graph. However, this representation fails to reflect the cross-connection interactions of multiple joints, and the current 3D human pose estimation methods have larger errors in opera videos due to the occlusion of clothing and movements in opera videos. In this paper, we propose a 3D human pose estimation method based on spatiotemporal hypergraphs for opera videos. First, the 2D human pose sequence of the opera video performer is inputted, and based on the interaction information between multiple joints in the opera action, multiple spatiotemporal hypergraphs representing the spatial correlation and temporal continuity of the joints are generated. Then, a hypergraph convolution network is constructed using the joints spatiotemporal hypergraphs to extract the spatiotemporal features in the 2D human poses sequence. Finally, a multi-hypergraph cross-attention mechanism is introduced to strengthen the correlation between spatiotemporal hypergraphs and predict 3D human poses. Experiments show that our method achieves the best performance on the Human3.6M and MPI-INF-3DHP datasets compared to the graph convolutional network and Transformer-based methods. In addition, ablation experiments show that the multiple spatiotemporal hypergraphs we generate can effectively improve the network accuracy compared to the undirected spatiotemporal graph. The experiments demonstrate that the method can obtain accurate 3D human poses in the presence of clothing and limb occlusion in opera videos. Codes will be available at: https://github.com/zhanghaoyu0408/hyperAzzy.

图卷积网络将人体骨架表示为一个无向时空图，从而大大提高了三维人体姿态估计的准确性。然而，这种表示方法无法反映多个关节的交叉连接相互作用，而且由于戏曲视频中服装和动作的遮挡，目前的三维人体姿态估计方法在戏曲视频中存在较大误差。本文提出了一种基于时空超图的戏曲视频三维人体姿态估计方法。首先，输入戏曲视频表演者的二维人体姿态序列，根据戏曲动作中多个关节之间的交互信息，生成代表关节空间相关性和时间连续性的多个时空超图。然后，利用关节时空超图构建超图卷积网络，提取二维人体姿势序列中的时空特征。最后，引入多超图交叉关注机制，加强时空超图之间的相关性，预测三维人体姿势。实验表明，与基于图卷积网络和变换器的方法相比，我们的方法在 Human3.6M 和 MPI-INF-3DHP 数据集上取得了最佳性能。此外，消融实验表明，与无向时空图相比，我们生成的多时空超图能有效提高网络的准确性。实验证明，该方法可以在歌剧视频中存在衣物和肢体遮挡的情况下获得准确的三维人体姿势。代码见：https://github.com/zhanghaoyu0408/hyperAzzy。

{"title":"3D human pose estimation using spatiotemporal hypergraphs and its public benchmark on opera videos","authors":"Xingquan Cai, Haoyu Zhang, LiZhe Chen, YiJie Wu, Haiyan Sun","doi":"10.1007/s00371-024-03604-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03604-y","url":null,"abstract":"Graph convolutional networks significantly improve the 3D human pose estimation accuracy by representing the human skeleton as an undirected spatiotemporal graph. However, this representation fails to reflect the cross-connection interactions of multiple joints, and the current 3D human pose estimation methods have larger errors in opera videos due to the occlusion of clothing and movements in opera videos. In this paper, we propose a 3D human pose estimation method based on spatiotemporal hypergraphs for opera videos. First, the 2D human pose sequence of the opera video performer is inputted, and based on the interaction information between multiple joints in the opera action, multiple spatiotemporal hypergraphs representing the spatial correlation and temporal continuity of the joints are generated. Then, a hypergraph convolution network is constructed using the joints spatiotemporal hypergraphs to extract the spatiotemporal features in the 2D human poses sequence. Finally, a multi-hypergraph cross-attention mechanism is introduced to strengthen the correlation between spatiotemporal hypergraphs and predict 3D human poses. Experiments show that our method achieves the best performance on the Human3.6M and MPI-INF-3DHP datasets compared to the graph convolutional network and Transformer-based methods. In addition, ablation experiments show that the multiple spatiotemporal hypergraphs we generate can effectively improve the network accuracy compared to the undirected spatiotemporal graph. The experiments demonstrate that the method can obtain accurate 3D human poses in the presence of clothing and limb occlusion in opera videos. Codes will be available at: https://github.com/zhanghaoyu0408/hyperAzzy.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lunet: an enhanced upsampling fusion network with efficient self-attention for semantic segmentation Lunet：用于语义分割的具有高效自我关注功能的增强型上采样融合网络

The Visual Computer

Pub Date : 2024-09-16 DOI: 10.1007/s00371-024-03590-1

Yan Zhou, Haibin Zhou, Yin Yang, Jianxun Li, Richard Irampaye, Dongli Wang, Zhengpeng Zhang

Semantic segmentation is an essential aspect of many computer vision tasks. Self-attention (SA)-based deep learning methods have shown impressive results in semantic segmentation by capturing long-range dependencies and contextual information. However, the standard SA module has high computational complexity, which limits its use in resource-constrained scenarios. This paper proposes a novel LUNet to improve semantic segmentation performance while addressing the computational challenges of SA. The lightweight self-attention plus (LSA++) module is introduced as a lightweight and efficient variant of the SA module. LSA++ uses compact feature representation and local position embedding to significantly reduce computational complexity while surpassing the accuracy of the standard SA module. Furthermore, to address the loss of edge details during decoding, we propose the enhanced upsampling fusion module (EUP-FM). This module comprises an enhanced upsampling module and a semantic vector-guided fusion mechanism. EUP-FM effectively recovers edge information and improves the precision of the segmentation map. Comprehensive experiments on PASCAL VOC 2012, Cityscapes, COCO, and SegPC 2021 demonstrate that LUNet outperforms all compared methods. It achieves superior runtime performance and accurate segmentation with excellent model generalization ability. The code is available at https://github.com/hbzhou530/LUNet.

语义分割是许多计算机视觉任务的一个重要方面。基于自我注意（SA）的深度学习方法通过捕捉长距离依赖关系和上下文信息，在语义分割方面取得了令人瞩目的成果。然而，标准的 SA 模块具有很高的计算复杂度，这限制了它在资源受限场景中的应用。本文提出了一种新型 LUNet，以提高语义分割性能，同时解决 SA 的计算难题。作为 SA 模块的一个轻量级高效变体，本文引入了轻量级自注意加（LSA++）模块。LSA++ 使用紧凑的特征表示和局部位置嵌入，大大降低了计算复杂度，同时超越了标准 SA 模块的精度。此外，为了解决解码过程中边缘细节丢失的问题，我们提出了增强型上采样融合模块（EUP-FM）。该模块由增强型上采样模块和语义向量引导的融合机制组成。EUP-FM 能有效恢复边缘信息，提高分割图的精度。在 PASCAL VOC 2012、Cityscapes、COCO 和 SegPC 2021 上进行的综合实验表明，LUNet 优于所有比较方法。它实现了卓越的运行性能和精确的分割，并具有出色的模型泛化能力。代码见 https://github.com/hbzhou530/LUNet。

{"title":"Lunet: an enhanced upsampling fusion network with efficient self-attention for semantic segmentation","authors":"Yan Zhou, Haibin Zhou, Yin Yang, Jianxun Li, Richard Irampaye, Dongli Wang, Zhengpeng Zhang","doi":"10.1007/s00371-024-03590-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03590-1","url":null,"abstract":"Semantic segmentation is an essential aspect of many computer vision tasks. Self-attention (SA)-based deep learning methods have shown impressive results in semantic segmentation by capturing long-range dependencies and contextual information. However, the standard SA module has high computational complexity, which limits its use in resource-constrained scenarios. This paper proposes a novel LUNet to improve semantic segmentation performance while addressing the computational challenges of SA. The lightweight self-attention plus (LSA++) module is introduced as a lightweight and efficient variant of the SA module. LSA++ uses compact feature representation and local position embedding to significantly reduce computational complexity while surpassing the accuracy of the standard SA module. Furthermore, to address the loss of edge details during decoding, we propose the enhanced upsampling fusion module (EUP-FM). This module comprises an enhanced upsampling module and a semantic vector-guided fusion mechanism. EUP-FM effectively recovers edge information and improves the precision of the segmentation map. Comprehensive experiments on PASCAL VOC 2012, Cityscapes, COCO, and SegPC 2021 demonstrate that LUNet outperforms all compared methods. It achieves superior runtime performance and accurate segmentation with excellent model generalization ability. The code is available at https://github.com/hbzhou530/LUNet.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution FDDCC-VSR：基于可变形三维卷积和廉价卷积的轻量级视频超分辨率网络

The Visual Computer

Pub Date : 2024-09-16 DOI: 10.1007/s00371-024-03621-x

Xiaohu Wang, Xin Yang, Hengrui Li, Tao Li

Currently, the mainstream deep video super-resolution (VSR) models typically employ deeper neural network layers or larger receptive fields. This approach increases computational requirements, making network training difficult and inefficient. Therefore, this paper proposes a VSR model called fusion of deformable 3D convolution and cheap convolution (FDDCC-VSR).In FDDCC-VSR, we first divide the detailed features of each frame in VSR into dynamic features of visual moving objects and details of static backgrounds. This division allows for the use of fewer specialized convolutions in feature extraction, resulting in a lightweight network that is easier to train. Furthermore, FDDCC-VSR incorporates multiple D-C CRBs (Convolutional Residual Blocks), which establish a lightweight spatial attention mechanism to aid deformable 3D convolution. This enables the model to focus on learning the corresponding feature details. Finally, we employ an improved bicubic interpolation combined with subpixel techniques to enhance the PSNR (Peak Signal-to-Noise Ratio) value of the original image. Detailed experiments demonstrate that FDDCC-VSR outperforms the most advanced algorithms in terms of both subjective visual effects and objective evaluation criteria. Additionally, our model exhibits a small parameter and calculation overhead.

目前，主流的深度视频超分辨率（VSR）模型通常采用更深的神经网络层或更大的感受野。这种方法增加了计算要求，使网络训练变得困难和低效。因此，本文提出了一种名为可变形三维卷积与廉价卷积融合（FDDCC-VSR）的 VSR 模型。在 FDDCC-VSR 中，我们首先将 VSR 中每一帧的细节特征分为视觉运动物体的动态特征和静态背景的细节特征。通过这种划分，可以在特征提取中使用较少的专门卷积，从而使网络更轻便，更易于训练。此外，FDDCC-VSR 还采用了多个 D-C CRB（卷积残差块），建立了一个轻量级的空间注意力机制，以辅助可变形三维卷积。这使得模型能够专注于学习相应的特征细节。最后，我们采用了改进的双三次插值法，并结合子像素技术来提高原始图像的 PSNR（峰值信噪比）值。详细的实验表明，FDDCC-VSR 在主观视觉效果和客观评价标准方面都优于最先进的算法。此外，我们的模型参数和计算开销很小。

{"title":"FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution","authors":"Xiaohu Wang, Xin Yang, Hengrui Li, Tao Li","doi":"10.1007/s00371-024-03621-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03621-x","url":null,"abstract":"Currently, the mainstream deep video super-resolution (VSR) models typically employ deeper neural network layers or larger receptive fields. This approach increases computational requirements, making network training difficult and inefficient. Therefore, this paper proposes a VSR model called fusion of deformable 3D convolution and cheap convolution (FDDCC-VSR).In FDDCC-VSR, we first divide the detailed features of each frame in VSR into dynamic features of visual moving objects and details of static backgrounds. This division allows for the use of fewer specialized convolutions in feature extraction, resulting in a lightweight network that is easier to train. Furthermore, FDDCC-VSR incorporates multiple D-C CRBs (Convolutional Residual Blocks), which establish a lightweight spatial attention mechanism to aid deformable 3D convolution. This enables the model to focus on learning the corresponding feature details. Finally, we employ an improved bicubic interpolation combined with subpixel techniques to enhance the PSNR (Peak Signal-to-Noise Ratio) value of the original image. Detailed experiments demonstrate that FDDCC-VSR outperforms the most advanced algorithms in terms of both subjective visual effects and objective evaluation criteria. Additionally, our model exhibits a small parameter and calculation overhead.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Topological structure extraction for computing surface–surface intersection curves 用于计算面-面交汇曲线的拓扑结构提取

The Visual Computer

Pub Date : 2024-09-16 DOI: 10.1007/s00371-024-03616-8

Pengbo Bo, Qingxiang Liu, Caiming Zhang

Surface–surface intersection curve computation is a fundamental problem in CAD and solid modeling. Extracting the structure of intersection curves accurately, especially when there are multiple overlapping curves, is a key challenge. Existing methods rely on densely sampled intersection points and proximity-based connections, which are time-consuming to obtain. In this paper, we propose a novel method based on Delaunay triangulation to accurately extract intersection curves, even with sparse intersection points. We also introduce an intersection curve optimization technique to enhance curve accuracy. Extensive experiments on various examples demonstrate the effectiveness of our method.

面-面交点曲线计算是 CAD 和实体建模中的一个基本问题。准确提取交点曲线的结构是一项关键挑战，尤其是当存在多条重叠曲线时。现有的方法依赖于密集采样的交点和基于邻近度的连接，而获取这些连接非常耗时。在本文中，我们提出了一种基于 Delaunay 三角测量的新方法，即使交点稀疏，也能精确提取交点曲线。我们还引入了交点曲线优化技术，以提高曲线的精确度。对各种实例的广泛实验证明了我们方法的有效性。

引用次数: 0

Optimizing underwater image enhancement: integrating semi-supervised learning and multi-scale aggregated attention 优化水下图像增强：整合半监督学习和多尺度聚合注意力

The Visual Computer

Pub Date : 2024-09-16 DOI: 10.1007/s00371-024-03611-z

Sunhan Xu, Jinhua Wang, Ning He, Guangmei Xu, Geng Zhang

Underwater image enhancement is critical for advancing marine science and underwater engineering. Traditional methods often struggle with color distortion, low contrast, and blurred details due to the challenging underwater environment. Addressing these issues, we introduce a semi-supervised underwater image enhancement framework, Semi-UIE, which leverages unlabeled data alongside limited labeled data to significantly enhance generalization capabilities. This framework integrates a novel aggregated attention within a UNet architecture, utilizing multi-scale convolutional kernels for efficient feature aggregation. This approach not only improves the sharpness and authenticity of underwater visuals but also ensures substantial computational efficiency. Importantly, Semi-UIE excels in capturing both macro- and micro-level details, effectively addressing common issues of over-correction and detail loss. Our experimental results demonstrate a marked improvement in performance on several public datasets, including UIEBD and EUVP, with notable enhancements in image quality metrics compared to existing methods. The robustness of our model across diverse underwater environments is confirmed by its superior performance on unlabeled datasets. Our code and pre-trained models are available at https://github.com/Sunhan-Ash/Semi-UIE.

水下图像增强对于推动海洋科学和水下工程至关重要。由于水下环境极具挑战性，传统方法往往难以解决色彩失真、对比度低和细节模糊等问题。为了解决这些问题，我们引入了一个半监督水下图像增强框架--Semi-UIE，该框架利用未标记数据和有限的标记数据来显著增强泛化能力。该框架在 UNet 架构中集成了新颖的聚合注意力，利用多尺度卷积核实现高效的特征聚合。这种方法不仅能提高水下视觉效果的清晰度和真实性，还能确保显著的计算效率。重要的是，Semi-UIE 在捕捉宏观和微观细节方面表现出色，有效地解决了过度校正和细节丢失等常见问题。我们的实验结果表明，在包括 UIEBD 和 EUVP 在内的几个公共数据集上，半 UIE 的性能有了明显改善，与现有方法相比，图像质量指标有了显著提高。我们的模型在无标签数据集上的优异表现证实了它在各种水下环境中的鲁棒性。我们的代码和预训练模型可在 https://github.com/Sunhan-Ash/Semi-UIE 上获取。

{"title":"Optimizing underwater image enhancement: integrating semi-supervised learning and multi-scale aggregated attention","authors":"Sunhan Xu, Jinhua Wang, Ning He, Guangmei Xu, Geng Zhang","doi":"10.1007/s00371-024-03611-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03611-z","url":null,"abstract":"Underwater image enhancement is critical for advancing marine science and underwater engineering. Traditional methods often struggle with color distortion, low contrast, and blurred details due to the challenging underwater environment. Addressing these issues, we introduce a semi-supervised underwater image enhancement framework, Semi-UIE, which leverages unlabeled data alongside limited labeled data to significantly enhance generalization capabilities. This framework integrates a novel aggregated attention within a UNet architecture, utilizing multi-scale convolutional kernels for efficient feature aggregation. This approach not only improves the sharpness and authenticity of underwater visuals but also ensures substantial computational efficiency. Importantly, Semi-UIE excels in capturing both macro- and micro-level details, effectively addressing common issues of over-correction and detail loss. Our experimental results demonstrate a marked improvement in performance on several public datasets, including UIEBD and EUVP, with notable enhancements in image quality metrics compared to existing methods. The robustness of our model across diverse underwater environments is confirmed by its superior performance on unlabeled datasets. Our code and pre-trained models are available at https://github.com/Sunhan-Ash/Semi-UIE.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FFCANet: a frequency channel fusion coordinate attention mechanism network for lane detection FFCANet：用于车道检测的频率信道融合协调注意机制网络

The Visual Computer

Pub Date : 2024-09-16 DOI: 10.1007/s00371-024-03626-6

Shijie Li, Shanhua Yao, Zhonggen Wang, Juan Wu

Lane line detection becomes a challenging task in complex and dynamic driving scenarios. Addressing the limitations of existing lane line detection algorithms, which struggle to balance accuracy and efficiency in complex and changing traffic scenarios, a frequency channel fusion coordinate attention mechanism network (FFCANet) for lane detection is proposed. A residual neural network (ResNet) is used as a feature extraction backbone network. We propose a feature enhancement method with a frequency channel fusion coordinate attention mechanism (FFCA) that captures feature information from different spatial orientations and then uses multiple frequency components to extract detail and texture features of lane lines. A row-anchor-based prediction and classification method treats lane line detection as a problem of selecting lane marking anchors within row-oriented cells predefined by global features, which greatly improves the detection speed and can handle visionless driving scenarios. Additionally, an efficient channel attention (ECA) module is integrated into the auxiliary segmentation branch to capture dynamic dependencies between channels, further enhancing feature extraction capabilities. The performance of the model is evaluated on two publicly available datasets, TuSimple and CULane. Simulation results demonstrate that the average processing time per image frame is 5.0 ms, with an accuracy of 96.09% on the TuSimple dataset and an F1 score of 72.8% on the CULane dataset. The model exhibits excellent robustness in detecting complex scenes while effectively balancing detection accuracy and speed. The source code is available at https://github.com/lsj1012/FFCANet/tree/master

在复杂多变的驾驶场景中，车道线检测成为一项具有挑战性的任务。针对现有车道线检测算法在复杂多变的交通场景中难以兼顾准确性和效率的局限性，提出了一种用于车道检测的频率信道融合协调注意力机制网络（FFCANet）。残差神经网络（ResNet）被用作特征提取骨干网络。我们提出了一种采用频率信道融合坐标注意机制（FFCA）的特征增强方法，该方法可捕获来自不同空间方向的特征信息，然后使用多个频率分量来提取车道线的细节和纹理特征。基于行锚点的预测和分类方法将车道线检测视为在全局特征预定义的面向行的单元内选择车道标记锚点的问题，从而大大提高了检测速度，并可处理无视觉驾驶场景。此外，在辅助分割分支中还集成了高效通道关注（ECA）模块，以捕捉通道之间的动态依赖关系，从而进一步提高特征提取能力。该模型的性能在两个公开的数据集（TuSimple 和 CULane）上进行了评估。仿真结果表明，每帧图像的平均处理时间为 5.0 毫秒，在 TuSimple 数据集上的准确率为 96.09%，在 CULane 数据集上的 F1 分数为 72.8%。该模型在检测复杂场景时表现出卓越的鲁棒性，同时有效平衡了检测精度和速度。源代码见 https://github.com/lsj1012/FFCANet/tree/master

{"title":"FFCANet: a frequency channel fusion coordinate attention mechanism network for lane detection","authors":"Shijie Li, Shanhua Yao, Zhonggen Wang, Juan Wu","doi":"10.1007/s00371-024-03626-6","DOIUrl":"https://doi.org/10.1007/s00371-024-03626-6","url":null,"abstract":"Lane line detection becomes a challenging task in complex and dynamic driving scenarios. Addressing the limitations of existing lane line detection algorithms, which struggle to balance accuracy and efficiency in complex and changing traffic scenarios, a frequency channel fusion coordinate attention mechanism network (FFCANet) for lane detection is proposed. A residual neural network (ResNet) is used as a feature extraction backbone network. We propose a feature enhancement method with a frequency channel fusion coordinate attention mechanism (FFCA) that captures feature information from different spatial orientations and then uses multiple frequency components to extract detail and texture features of lane lines. A row-anchor-based prediction and classification method treats lane line detection as a problem of selecting lane marking anchors within row-oriented cells predefined by global features, which greatly improves the detection speed and can handle visionless driving scenarios. Additionally, an efficient channel attention (ECA) module is integrated into the auxiliary segmentation branch to capture dynamic dependencies between channels, further enhancing feature extraction capabilities. The performance of the model is evaluated on two publicly available datasets, TuSimple and CULane. Simulation results demonstrate that the average processing time per image frame is 5.0 ms, with an accuracy of 96.09% on the TuSimple dataset and an F1 score of 72.8% on the CULane dataset. The model exhibits excellent robustness in detecting complex scenes while effectively balancing detection accuracy and speed. The source code is available at https://github.com/lsj1012/FFCANet/tree/master","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Text-guided floral image generation based on lightweight deep attention feature fusion GAN 基于轻量级深度关注特征融合 GAN 的文本引导花卉图像生成技术

The Visual Computer

Pub Date : 2024-09-14 DOI: 10.1007/s00371-024-03617-7

Wenji Yang, Hang An, Wenchao Hu, Xinxin Ma, Liping Xie

Generating floral images conditioned on textual descriptions is a highly challenging task. However, most existing text-to-floral image synthesis methods adopt a single-stage generation architecture, which often requires substantial hardware resources, such as large-scale GPU clusters and a large number of training images. Moreover, this architecture tends to lose some detail features when shallow image features are fused with deep image features. To address these challenges, this paper proposes a Lightweight Deep Attention Feature Fusion Generative Adversarial Network for the text-to-floral image generation task. This network performs impressively well even with limited hardware resources. Specifically, we introduce a novel Deep Attention Text-Image Fusion Block that leverages Multi-scale Channel Attention Mechanisms to effectively enhance the capability of displaying details and visual consistency in text-generated floral images. Secondly, we propose a novel Self-Supervised Target-Aware Discriminator capable of learning a richer feature mapping coverage area from input images. This not only aids the generator in creating higher-quality images but also improves the training efficiency of GANs, further reducing resource consumption. Finally, extensive experiments on dataset of three different sample sizes validate the effectiveness of the proposed model. Source code and pretrained models are available at https://github.com/BoomAnm/LDAF-GAN.

根据文字描述生成花卉图像是一项极具挑战性的任务。然而，现有的文本到花卉图像合成方法大多采用单级生成架构，通常需要大量硬件资源，如大规模 GPU 集群和大量训练图像。此外，当浅层图像特征与深层图像特征融合时，这种架构往往会丢失一些细节特征。为了应对这些挑战，本文针对文本到花卉图像生成任务提出了一种轻量级深度注意力特征融合生成对抗网络。即使在硬件资源有限的情况下，该网络的表现也令人印象深刻。具体来说，我们引入了一个新颖的深度注意力文本图像融合块，利用多尺度通道注意力机制，有效增强了文本生成的花卉图像的细节显示能力和视觉一致性。其次，我们提出了一种新颖的自监督目标感知判别器，能够从输入图像中学习更丰富的特征映射覆盖区域。这不仅有助于生成器创建更高质量的图像，还能提高 GAN 的训练效率，进一步减少资源消耗。最后，在三种不同样本量的数据集上进行的大量实验验证了所提模型的有效性。源代码和预训练模型见 https://github.com/BoomAnm/LDAF-GAN。

{"title":"Text-guided floral image generation based on lightweight deep attention feature fusion GAN","authors":"Wenji Yang, Hang An, Wenchao Hu, Xinxin Ma, Liping Xie","doi":"10.1007/s00371-024-03617-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03617-7","url":null,"abstract":"Generating floral images conditioned on textual descriptions is a highly challenging task. However, most existing text-to-floral image synthesis methods adopt a single-stage generation architecture, which often requires substantial hardware resources, such as large-scale GPU clusters and a large number of training images. Moreover, this architecture tends to lose some detail features when shallow image features are fused with deep image features. To address these challenges, this paper proposes a Lightweight Deep Attention Feature Fusion Generative Adversarial Network for the text-to-floral image generation task. This network performs impressively well even with limited hardware resources. Specifically, we introduce a novel Deep Attention Text-Image Fusion Block that leverages Multi-scale Channel Attention Mechanisms to effectively enhance the capability of displaying details and visual consistency in text-generated floral images. Secondly, we propose a novel Self-Supervised Target-Aware Discriminator capable of learning a richer feature mapping coverage area from input images. This not only aids the generator in creating higher-quality images but also improves the training efficiency of GANs, further reducing resource consumption. Finally, extensive experiments on dataset of three different sample sizes validate the effectiveness of the proposed model. Source code and pretrained models are available at https://github.com/BoomAnm/LDAF-GAN.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Research on a small target object detection method for aerial photography based on improved YOLOv7 基于改进型 YOLOv7 的航空摄影小目标物检测方法研究

The Visual Computer

Pub Date : 2024-09-14 DOI: 10.1007/s00371-024-03615-9

Jiajun Yang, Xuesong Zhang, Cunli Song

In aerial imagery analysis, detecting small targets is highly challenging due to their minimal pixel representation and complex backgrounds. To address this issue, this manuscript proposes a novel method for detecting small aerial targets. Firstly, the K-means + + algorithm is utilized to generate anchor boxes suitable for small targets. Secondly, the YOLOv7-BFAW model is proposed. This method incorporates a series of improvements to YOLOv7, including the integration of a BBF residual structure based on BiFormer and BottleNeck fusion into the backbone network, the design of an MPsim module based on simAM attention for the head network, and the development of a novel loss function, inner-WIoU v2, as the localization loss function, based on WIoU v2. Experiments demonstrate that YOLOv7-BFAW achieves a 4.2% mAP@.5 improvement on the DOTA v1.0 dataset and a 1.7% mAP@.5 improvement on the VisDrone2019 dataset, showcasing excellent generalization capabilities. Furthermore, it is shown that YOLOv7-BFAW exhibits superior detection performance compared to state-of-the-art algorithms.

在航空图像分析中，由于小目标的像素极小且背景复杂，因此对其进行检测极具挑战性。为解决这一问题，本手稿提出了一种检测小型航空目标的新方法。首先，利用 K-means + + 算法生成适合小型目标的锚点框。其次，提出了 YOLOv7-BFAW 模型。该方法对 YOLOv7 进行了一系列改进，包括在骨干网络中集成了基于 BiFormer 和 BottleNeck 融合的 BBF 残差结构，在头部网络中设计了基于 simAM attention 的 MPsim 模块，并在 WIoU v2 的基础上开发了新的损失函数 inner-WIoU v2 作为定位损失函数。实验证明，YOLOv7-BFAW 在 DOTA v1.0 数据集上实现了 4.2% mAP@.5 的改进，在 VisDrone2019 数据集上实现了 1.7% mAP@.5 的改进，展示了出色的泛化能力。此外，研究还表明，与最先进的算法相比，YOLOv7-BFAW 的检测性能更为出色。

引用次数: 0