Pub Date : 2024-09-18DOI: 10.1007/s00371-024-03613-x
Muhammad Fahad, Tao Zhang, Yasir Iqbal, Azaz Ikram, Fazeela Siddiqui, Bin Younas Abdullah, Malik Muhammad Nauman, Xin Zhao, Yanzhang Geng
Artificial intelligence has revolutionized technology, with generative adversarial networks (GANs) generating fake samples and deepfake videos. These technologies can lead to panic and instability, allowing anyone to produce propaganda. Therefore, it is crucial to develop a robust system to distinguish between authentic and counterfeit information in the current social media era. This study offers an automated approach for categorizing deepfake videos using advanced machine learning and deep learning techniques. The processed videos are classified using a deep learning-based enhanced Resnet-18 with convolutional neural network (CNN) multilayer max pooling. This research contributes to studying precise detection techniques for deepfake technology, which is gradually becoming a serious problem for digital media. The proposed enhanced Resnet-18 CNN method integrates deep learning algorithms on GAN architecture and artificial intelligence-generated videos to analyze and determine genuine and fake videos. In this research, we fuse the sub-datasets (faceswap, face2face, deepfakes, neural textures) of FaceForensics, CelebDF, DeeperForensics, DeepFake detection and our own created private dataset into one combined dataset, and the total number of videos are (11,404) in this fused dataset. The dataset on which it was trained has a diverse range of videos and sentiments, demonstrating its capability. The structure of the model is designed to predict and identify videos with faces accurately switched as fakes, while those without switches are real. This paper is a great leap forward in the area of digital forensics, providing an excellent response to deepfakes. The proposed model outperformed conventional methods in predicting video frames, with an accuracy score of 99.99%, F-score of 99.98%, recall of 100%, and precision of 99.99%, confirming its effectiveness through a comparative analysis. The source code of this study is available publically at https://doi.org/10.5281/zenodo.12538330.
{"title":"Advanced deepfake detection with enhanced Resnet-18 and multilayer CNN max pooling","authors":"Muhammad Fahad, Tao Zhang, Yasir Iqbal, Azaz Ikram, Fazeela Siddiqui, Bin Younas Abdullah, Malik Muhammad Nauman, Xin Zhao, Yanzhang Geng","doi":"10.1007/s00371-024-03613-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03613-x","url":null,"abstract":"<p>Artificial intelligence has revolutionized technology, with generative adversarial networks (GANs) generating fake samples and deepfake videos. These technologies can lead to panic and instability, allowing anyone to produce propaganda. Therefore, it is crucial to develop a robust system to distinguish between authentic and counterfeit information in the current social media era. This study offers an automated approach for categorizing deepfake videos using advanced machine learning and deep learning techniques. The processed videos are classified using a deep learning-based enhanced Resnet-18 with convolutional neural network (CNN) multilayer max pooling. This research contributes to studying precise detection techniques for deepfake technology, which is gradually becoming a serious problem for digital media. The proposed enhanced Resnet-18 CNN method integrates deep learning algorithms on GAN architecture and artificial intelligence-generated videos to analyze and determine genuine and fake videos. In this research, we fuse the sub-datasets (faceswap, face2face, deepfakes, neural textures) of FaceForensics, CelebDF, DeeperForensics, DeepFake detection and our own created private dataset into one combined dataset, and the total number of videos are (11,404) in this fused dataset. The dataset on which it was trained has a diverse range of videos and sentiments, demonstrating its capability. The structure of the model is designed to predict and identify videos with faces accurately switched as fakes, while those without switches are real. This paper is a great leap forward in the area of digital forensics, providing an excellent response to deepfakes. The proposed model outperformed conventional methods in predicting video frames, with an accuracy score of 99.99%, F-score of 99.98%, recall of 100%, and precision of 99.99%, confirming its effectiveness through a comparative analysis. The source code of this study is available publically at https://doi.org/10.5281/zenodo.12538330.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-18DOI: 10.1007/s00371-024-03606-w
Wan-He Kai, Kai-Xin Xing
The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. However, the research work on LLms for music inspiration is still in its infancy. To fill the gap in this field and break through the dilemma that LLMs can only understand short videos with limited frames, we propose a large language model with state space for long-term video-to-music generation. To capture long-range dependency and maintaining high performance, while further decrease the computing cost, our overall network includes the Enhanced Video Mamba, which incorporates continuous moving window partitioning and local feature augmentation, and a long-term memory bank that captures and aggregates historical video information to mitigate information loss in long sequences. This framework achieves both subquadratic-time computation and near-linear memory complexity, enabling effective long-term video-to-music generation. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models. Our code released on https://github.com/kai211233/S2L2-V2M.
目前,利用大型语言模型(LLMs)的研究正在蓬勃发展。许多作品利用这些模型强大的推理能力来理解各种模式,如文本、语音、图像、视频等。然而,针对音乐灵感的大语言模型研究工作仍处于起步阶段。为了填补这一领域的空白,突破 LLM 只能理解帧数有限的短视频的窘境,我们提出了一种具有状态空间的大型语言模型,用于从视频到音乐的长期生成。为了捕捉长距离依赖性并保持高性能,同时进一步降低计算成本,我们的整体网络包括增强型视频曼巴(Enhanced Video Mamba),它集成了连续移动窗口分割和局部特征增强功能,以及一个长期记忆库(用于捕捉和聚合历史视频信息,以减少长序列中的信息丢失)。该框架可实现亚二次方时间计算和近线性内存复杂性,从而实现有效的长期视频到音乐生成。我们对所提出的框架进行了全面评估。实验结果表明,我们的模型达到或超过了当前最先进模型的性能。我们的代码发布于 https://github.com/kai211233/S2L2-V2M。
{"title":"Video-driven musical composition using large language model with memory-augmented state space","authors":"Wan-He Kai, Kai-Xin Xing","doi":"10.1007/s00371-024-03606-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03606-w","url":null,"abstract":"<p>The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. However, the research work on LLms for music inspiration is still in its infancy. To fill the gap in this field and break through the dilemma that LLMs can only understand short videos with limited frames, we propose a large language model with state space for long-term video-to-music generation. To capture long-range dependency and maintaining high performance, while further decrease the computing cost, our overall network includes the Enhanced Video Mamba, which incorporates continuous moving window partitioning and local feature augmentation, and a long-term memory bank that captures and aggregates historical video information to mitigate information loss in long sequences. This framework achieves both subquadratic-time computation and near-linear memory complexity, enabling effective long-term video-to-music generation. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models. Our code released on https://github.com/kai211233/S2L2-V2M.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1007/s00371-024-03604-y
Xingquan Cai, Haoyu Zhang, LiZhe Chen, YiJie Wu, Haiyan Sun
Graph convolutional networks significantly improve the 3D human pose estimation accuracy by representing the human skeleton as an undirected spatiotemporal graph. However, this representation fails to reflect the cross-connection interactions of multiple joints, and the current 3D human pose estimation methods have larger errors in opera videos due to the occlusion of clothing and movements in opera videos. In this paper, we propose a 3D human pose estimation method based on spatiotemporal hypergraphs for opera videos. First, the 2D human pose sequence of the opera video performer is inputted, and based on the interaction information between multiple joints in the opera action, multiple spatiotemporal hypergraphs representing the spatial correlation and temporal continuity of the joints are generated. Then, a hypergraph convolution network is constructed using the joints spatiotemporal hypergraphs to extract the spatiotemporal features in the 2D human poses sequence. Finally, a multi-hypergraph cross-attention mechanism is introduced to strengthen the correlation between spatiotemporal hypergraphs and predict 3D human poses. Experiments show that our method achieves the best performance on the Human3.6M and MPI-INF-3DHP datasets compared to the graph convolutional network and Transformer-based methods. In addition, ablation experiments show that the multiple spatiotemporal hypergraphs we generate can effectively improve the network accuracy compared to the undirected spatiotemporal graph. The experiments demonstrate that the method can obtain accurate 3D human poses in the presence of clothing and limb occlusion in opera videos. Codes will be available at: https://github.com/zhanghaoyu0408/hyperAzzy.
{"title":"3D human pose estimation using spatiotemporal hypergraphs and its public benchmark on opera videos","authors":"Xingquan Cai, Haoyu Zhang, LiZhe Chen, YiJie Wu, Haiyan Sun","doi":"10.1007/s00371-024-03604-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03604-y","url":null,"abstract":"<p>Graph convolutional networks significantly improve the 3D human pose estimation accuracy by representing the human skeleton as an undirected spatiotemporal graph. However, this representation fails to reflect the cross-connection interactions of multiple joints, and the current 3D human pose estimation methods have larger errors in opera videos due to the occlusion of clothing and movements in opera videos. In this paper, we propose a 3D human pose estimation method based on spatiotemporal hypergraphs for opera videos. <i>First, the 2D human pose sequence of the opera video performer is inputted, and based on the interaction information between multiple joints in the opera action, multiple spatiotemporal hypergraphs representing the spatial correlation and temporal continuity of the joints are generated. Then, a hypergraph convolution network is constructed using the joints spatiotemporal hypergraphs to extract the spatiotemporal features in the 2D human poses sequence. Finally, a multi-hypergraph cross-attention mechanism is introduced to strengthen the correlation between spatiotemporal hypergraphs and predict 3D human poses</i>. Experiments show that our method achieves the best performance on the Human3.6M and MPI-INF-3DHP datasets compared to the graph convolutional network and Transformer-based methods. In addition, ablation experiments show that the multiple spatiotemporal hypergraphs we generate can effectively improve the network accuracy compared to the undirected spatiotemporal graph. The experiments demonstrate that the method can obtain accurate 3D human poses in the presence of clothing and limb occlusion in opera videos. Codes will be available at: https://github.com/zhanghaoyu0408/hyperAzzy.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1007/s00371-024-03590-1
Yan Zhou, Haibin Zhou, Yin Yang, Jianxun Li, Richard Irampaye, Dongli Wang, Zhengpeng Zhang
Semantic segmentation is an essential aspect of many computer vision tasks. Self-attention (SA)-based deep learning methods have shown impressive results in semantic segmentation by capturing long-range dependencies and contextual information. However, the standard SA module has high computational complexity, which limits its use in resource-constrained scenarios. This paper proposes a novel LUNet to improve semantic segmentation performance while addressing the computational challenges of SA. The lightweight self-attention plus (LSA++) module is introduced as a lightweight and efficient variant of the SA module. LSA++ uses compact feature representation and local position embedding to significantly reduce computational complexity while surpassing the accuracy of the standard SA module. Furthermore, to address the loss of edge details during decoding, we propose the enhanced upsampling fusion module (EUP-FM). This module comprises an enhanced upsampling module and a semantic vector-guided fusion mechanism. EUP-FM effectively recovers edge information and improves the precision of the segmentation map. Comprehensive experiments on PASCAL VOC 2012, Cityscapes, COCO, and SegPC 2021 demonstrate that LUNet outperforms all compared methods. It achieves superior runtime performance and accurate segmentation with excellent model generalization ability. The code is available at https://github.com/hbzhou530/LUNet.
语义分割是许多计算机视觉任务的一个重要方面。基于自我注意(SA)的深度学习方法通过捕捉长距离依赖关系和上下文信息,在语义分割方面取得了令人瞩目的成果。然而,标准的 SA 模块具有很高的计算复杂度,这限制了它在资源受限场景中的应用。本文提出了一种新型 LUNet,以提高语义分割性能,同时解决 SA 的计算难题。作为 SA 模块的一个轻量级高效变体,本文引入了轻量级自注意加(LSA++)模块。LSA++ 使用紧凑的特征表示和局部位置嵌入,大大降低了计算复杂度,同时超越了标准 SA 模块的精度。此外,为了解决解码过程中边缘细节丢失的问题,我们提出了增强型上采样融合模块(EUP-FM)。该模块由增强型上采样模块和语义向量引导的融合机制组成。EUP-FM 能有效恢复边缘信息,提高分割图的精度。在 PASCAL VOC 2012、Cityscapes、COCO 和 SegPC 2021 上进行的综合实验表明,LUNet 优于所有比较方法。它实现了卓越的运行性能和精确的分割,并具有出色的模型泛化能力。代码见 https://github.com/hbzhou530/LUNet。
{"title":"Lunet: an enhanced upsampling fusion network with efficient self-attention for semantic segmentation","authors":"Yan Zhou, Haibin Zhou, Yin Yang, Jianxun Li, Richard Irampaye, Dongli Wang, Zhengpeng Zhang","doi":"10.1007/s00371-024-03590-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03590-1","url":null,"abstract":"<p>Semantic segmentation is an essential aspect of many computer vision tasks. Self-attention (SA)-based deep learning methods have shown impressive results in semantic segmentation by capturing long-range dependencies and contextual information. However, the standard SA module has high computational complexity, which limits its use in resource-constrained scenarios. This paper proposes a novel LUNet to improve semantic segmentation performance while addressing the computational challenges of SA. The lightweight self-attention plus (LSA++) module is introduced as a lightweight and efficient variant of the SA module. LSA++ uses compact feature representation and local position embedding to significantly reduce computational complexity while surpassing the accuracy of the standard SA module. Furthermore, to address the loss of edge details during decoding, we propose the enhanced upsampling fusion module (EUP-FM). This module comprises an enhanced upsampling module and a semantic vector-guided fusion mechanism. EUP-FM effectively recovers edge information and improves the precision of the segmentation map. Comprehensive experiments on PASCAL VOC 2012, Cityscapes, COCO, and SegPC 2021 demonstrate that LUNet outperforms all compared methods. It achieves superior runtime performance and accurate segmentation with excellent model generalization ability. The code is available at https://github.com/hbzhou530/LUNet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1007/s00371-024-03621-x
Xiaohu Wang, Xin Yang, Hengrui Li, Tao Li
Currently, the mainstream deep video super-resolution (VSR) models typically employ deeper neural network layers or larger receptive fields. This approach increases computational requirements, making network training difficult and inefficient. Therefore, this paper proposes a VSR model called fusion of deformable 3D convolution and cheap convolution (FDDCC-VSR).In FDDCC-VSR, we first divide the detailed features of each frame in VSR into dynamic features of visual moving objects and details of static backgrounds. This division allows for the use of fewer specialized convolutions in feature extraction, resulting in a lightweight network that is easier to train. Furthermore, FDDCC-VSR incorporates multiple D-C CRBs (Convolutional Residual Blocks), which establish a lightweight spatial attention mechanism to aid deformable 3D convolution. This enables the model to focus on learning the corresponding feature details. Finally, we employ an improved bicubic interpolation combined with subpixel techniques to enhance the PSNR (Peak Signal-to-Noise Ratio) value of the original image. Detailed experiments demonstrate that FDDCC-VSR outperforms the most advanced algorithms in terms of both subjective visual effects and objective evaluation criteria. Additionally, our model exhibits a small parameter and calculation overhead.
{"title":"FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution","authors":"Xiaohu Wang, Xin Yang, Hengrui Li, Tao Li","doi":"10.1007/s00371-024-03621-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03621-x","url":null,"abstract":"<p>Currently, the mainstream deep video super-resolution (VSR) models typically employ deeper neural network layers or larger receptive fields. This approach increases computational requirements, making network training difficult and inefficient. Therefore, this paper proposes a VSR model called fusion of deformable 3D convolution and cheap convolution (FDDCC-VSR).In FDDCC-VSR, we first divide the detailed features of each frame in VSR into dynamic features of visual moving objects and details of static backgrounds. This division allows for the use of fewer specialized convolutions in feature extraction, resulting in a lightweight network that is easier to train. Furthermore, FDDCC-VSR incorporates multiple D-C CRBs (Convolutional Residual Blocks), which establish a lightweight spatial attention mechanism to aid deformable 3D convolution. This enables the model to focus on learning the corresponding feature details. Finally, we employ an improved bicubic interpolation combined with subpixel techniques to enhance the PSNR (Peak Signal-to-Noise Ratio) value of the original image. Detailed experiments demonstrate that FDDCC-VSR outperforms the most advanced algorithms in terms of both subjective visual effects and objective evaluation criteria. Additionally, our model exhibits a small parameter and calculation overhead.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1007/s00371-024-03616-8
Pengbo Bo, Qingxiang Liu, Caiming Zhang
Surface–surface intersection curve computation is a fundamental problem in CAD and solid modeling. Extracting the structure of intersection curves accurately, especially when there are multiple overlapping curves, is a key challenge. Existing methods rely on densely sampled intersection points and proximity-based connections, which are time-consuming to obtain. In this paper, we propose a novel method based on Delaunay triangulation to accurately extract intersection curves, even with sparse intersection points. We also introduce an intersection curve optimization technique to enhance curve accuracy. Extensive experiments on various examples demonstrate the effectiveness of our method.
{"title":"Topological structure extraction for computing surface–surface intersection curves","authors":"Pengbo Bo, Qingxiang Liu, Caiming Zhang","doi":"10.1007/s00371-024-03616-8","DOIUrl":"https://doi.org/10.1007/s00371-024-03616-8","url":null,"abstract":"<p>Surface–surface intersection curve computation is a fundamental problem in CAD and solid modeling. Extracting the structure of intersection curves accurately, especially when there are multiple overlapping curves, is a key challenge. Existing methods rely on densely sampled intersection points and proximity-based connections, which are time-consuming to obtain. In this paper, we propose a novel method based on Delaunay triangulation to accurately extract intersection curves, even with sparse intersection points. We also introduce an intersection curve optimization technique to enhance curve accuracy. Extensive experiments on various examples demonstrate the effectiveness of our method.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1007/s00371-024-03611-z
Sunhan Xu, Jinhua Wang, Ning He, Guangmei Xu, Geng Zhang
Underwater image enhancement is critical for advancing marine science and underwater engineering. Traditional methods often struggle with color distortion, low contrast, and blurred details due to the challenging underwater environment. Addressing these issues, we introduce a semi-supervised underwater image enhancement framework, Semi-UIE, which leverages unlabeled data alongside limited labeled data to significantly enhance generalization capabilities. This framework integrates a novel aggregated attention within a UNet architecture, utilizing multi-scale convolutional kernels for efficient feature aggregation. This approach not only improves the sharpness and authenticity of underwater visuals but also ensures substantial computational efficiency. Importantly, Semi-UIE excels in capturing both macro- and micro-level details, effectively addressing common issues of over-correction and detail loss. Our experimental results demonstrate a marked improvement in performance on several public datasets, including UIEBD and EUVP, with notable enhancements in image quality metrics compared to existing methods. The robustness of our model across diverse underwater environments is confirmed by its superior performance on unlabeled datasets. Our code and pre-trained models are available at https://github.com/Sunhan-Ash/Semi-UIE.
{"title":"Optimizing underwater image enhancement: integrating semi-supervised learning and multi-scale aggregated attention","authors":"Sunhan Xu, Jinhua Wang, Ning He, Guangmei Xu, Geng Zhang","doi":"10.1007/s00371-024-03611-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03611-z","url":null,"abstract":"<p>Underwater image enhancement is critical for advancing marine science and underwater engineering. Traditional methods often struggle with color distortion, low contrast, and blurred details due to the challenging underwater environment. Addressing these issues, we introduce a semi-supervised underwater image enhancement framework, Semi-UIE, which leverages unlabeled data alongside limited labeled data to significantly enhance generalization capabilities. This framework integrates a novel aggregated attention within a UNet architecture, utilizing multi-scale convolutional kernels for efficient feature aggregation. This approach not only improves the sharpness and authenticity of underwater visuals but also ensures substantial computational efficiency. Importantly, Semi-UIE excels in capturing both macro- and micro-level details, effectively addressing common issues of over-correction and detail loss. Our experimental results demonstrate a marked improvement in performance on several public datasets, including UIEBD and EUVP, with notable enhancements in image quality metrics compared to existing methods. The robustness of our model across diverse underwater environments is confirmed by its superior performance on unlabeled datasets. Our code and pre-trained models are available at https://github.com/Sunhan-Ash/Semi-UIE.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16DOI: 10.1007/s00371-024-03626-6
Shijie Li, Shanhua Yao, Zhonggen Wang, Juan Wu
Lane line detection becomes a challenging task in complex and dynamic driving scenarios. Addressing the limitations of existing lane line detection algorithms, which struggle to balance accuracy and efficiency in complex and changing traffic scenarios, a frequency channel fusion coordinate attention mechanism network (FFCANet) for lane detection is proposed. A residual neural network (ResNet) is used as a feature extraction backbone network. We propose a feature enhancement method with a frequency channel fusion coordinate attention mechanism (FFCA) that captures feature information from different spatial orientations and then uses multiple frequency components to extract detail and texture features of lane lines. A row-anchor-based prediction and classification method treats lane line detection as a problem of selecting lane marking anchors within row-oriented cells predefined by global features, which greatly improves the detection speed and can handle visionless driving scenarios. Additionally, an efficient channel attention (ECA) module is integrated into the auxiliary segmentation branch to capture dynamic dependencies between channels, further enhancing feature extraction capabilities. The performance of the model is evaluated on two publicly available datasets, TuSimple and CULane. Simulation results demonstrate that the average processing time per image frame is 5.0 ms, with an accuracy of 96.09% on the TuSimple dataset and an F1 score of 72.8% on the CULane dataset. The model exhibits excellent robustness in detecting complex scenes while effectively balancing detection accuracy and speed. The source code is available at https://github.com/lsj1012/FFCANet/tree/master
{"title":"FFCANet: a frequency channel fusion coordinate attention mechanism network for lane detection","authors":"Shijie Li, Shanhua Yao, Zhonggen Wang, Juan Wu","doi":"10.1007/s00371-024-03626-6","DOIUrl":"https://doi.org/10.1007/s00371-024-03626-6","url":null,"abstract":"<p>Lane line detection becomes a challenging task in complex and dynamic driving scenarios. Addressing the limitations of existing lane line detection algorithms, which struggle to balance accuracy and efficiency in complex and changing traffic scenarios, a frequency channel fusion coordinate attention mechanism network (FFCANet) for lane detection is proposed. A residual neural network (ResNet) is used as a feature extraction backbone network. We propose a feature enhancement method with a frequency channel fusion coordinate attention mechanism (FFCA) that captures feature information from different spatial orientations and then uses multiple frequency components to extract detail and texture features of lane lines. A row-anchor-based prediction and classification method treats lane line detection as a problem of selecting lane marking anchors within row-oriented cells predefined by global features, which greatly improves the detection speed and can handle visionless driving scenarios. Additionally, an efficient channel attention (ECA) module is integrated into the auxiliary segmentation branch to capture dynamic dependencies between channels, further enhancing feature extraction capabilities. The performance of the model is evaluated on two publicly available datasets, TuSimple and CULane. Simulation results demonstrate that the average processing time per image frame is 5.0 ms, with an accuracy of 96.09% on the TuSimple dataset and an F1 score of 72.8% on the CULane dataset. The model exhibits excellent robustness in detecting complex scenes while effectively balancing detection accuracy and speed. The source code is available at https://github.com/lsj1012/FFCANet/tree/master</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1007/s00371-024-03617-7
Wenji Yang, Hang An, Wenchao Hu, Xinxin Ma, Liping Xie
Generating floral images conditioned on textual descriptions is a highly challenging task. However, most existing text-to-floral image synthesis methods adopt a single-stage generation architecture, which often requires substantial hardware resources, such as large-scale GPU clusters and a large number of training images. Moreover, this architecture tends to lose some detail features when shallow image features are fused with deep image features. To address these challenges, this paper proposes a Lightweight Deep Attention Feature Fusion Generative Adversarial Network for the text-to-floral image generation task. This network performs impressively well even with limited hardware resources. Specifically, we introduce a novel Deep Attention Text-Image Fusion Block that leverages Multi-scale Channel Attention Mechanisms to effectively enhance the capability of displaying details and visual consistency in text-generated floral images. Secondly, we propose a novel Self-Supervised Target-Aware Discriminator capable of learning a richer feature mapping coverage area from input images. This not only aids the generator in creating higher-quality images but also improves the training efficiency of GANs, further reducing resource consumption. Finally, extensive experiments on dataset of three different sample sizes validate the effectiveness of the proposed model. Source code and pretrained models are available at https://github.com/BoomAnm/LDAF-GAN.
根据文字描述生成花卉图像是一项极具挑战性的任务。然而,现有的文本到花卉图像合成方法大多采用单级生成架构,通常需要大量硬件资源,如大规模 GPU 集群和大量训练图像。此外,当浅层图像特征与深层图像特征融合时,这种架构往往会丢失一些细节特征。为了应对这些挑战,本文针对文本到花卉图像生成任务提出了一种轻量级深度注意力特征融合生成对抗网络。即使在硬件资源有限的情况下,该网络的表现也令人印象深刻。具体来说,我们引入了一个新颖的深度注意力文本图像融合块,利用多尺度通道注意力机制,有效增强了文本生成的花卉图像的细节显示能力和视觉一致性。其次,我们提出了一种新颖的自监督目标感知判别器,能够从输入图像中学习更丰富的特征映射覆盖区域。这不仅有助于生成器创建更高质量的图像,还能提高 GAN 的训练效率,进一步减少资源消耗。最后,在三种不同样本量的数据集上进行的大量实验验证了所提模型的有效性。源代码和预训练模型见 https://github.com/BoomAnm/LDAF-GAN。
{"title":"Text-guided floral image generation based on lightweight deep attention feature fusion GAN","authors":"Wenji Yang, Hang An, Wenchao Hu, Xinxin Ma, Liping Xie","doi":"10.1007/s00371-024-03617-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03617-7","url":null,"abstract":"<p>Generating floral images conditioned on textual descriptions is a highly challenging task. However, most existing text-to-floral image synthesis methods adopt a single-stage generation architecture, which often requires substantial hardware resources, such as large-scale GPU clusters and a large number of training images. Moreover, this architecture tends to lose some detail features when shallow image features are fused with deep image features. To address these challenges, this paper proposes a Lightweight Deep Attention Feature Fusion Generative Adversarial Network for the text-to-floral image generation task. This network performs impressively well even with limited hardware resources. Specifically, we introduce a novel Deep Attention Text-Image Fusion Block that leverages Multi-scale Channel Attention Mechanisms to effectively enhance the capability of displaying details and visual consistency in text-generated floral images. Secondly, we propose a novel Self-Supervised Target-Aware Discriminator capable of learning a richer feature mapping coverage area from input images. This not only aids the generator in creating higher-quality images but also improves the training efficiency of GANs, further reducing resource consumption. Finally, extensive experiments on dataset of three different sample sizes validate the effectiveness of the proposed model. Source code and pretrained models are available at https://github.com/BoomAnm/LDAF-GAN.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-14DOI: 10.1007/s00371-024-03615-9
Jiajun Yang, Xuesong Zhang, Cunli Song
In aerial imagery analysis, detecting small targets is highly challenging due to their minimal pixel representation and complex backgrounds. To address this issue, this manuscript proposes a novel method for detecting small aerial targets. Firstly, the K-means + + algorithm is utilized to generate anchor boxes suitable for small targets. Secondly, the YOLOv7-BFAW model is proposed. This method incorporates a series of improvements to YOLOv7, including the integration of a BBF residual structure based on BiFormer and BottleNeck fusion into the backbone network, the design of an MPsim module based on simAM attention for the head network, and the development of a novel loss function, inner-WIoU v2, as the localization loss function, based on WIoU v2. Experiments demonstrate that YOLOv7-BFAW achieves a 4.2% mAP@.5 improvement on the DOTA v1.0 dataset and a 1.7% mAP@.5 improvement on the VisDrone2019 dataset, showcasing excellent generalization capabilities. Furthermore, it is shown that YOLOv7-BFAW exhibits superior detection performance compared to state-of-the-art algorithms.
{"title":"Research on a small target object detection method for aerial photography based on improved YOLOv7","authors":"Jiajun Yang, Xuesong Zhang, Cunli Song","doi":"10.1007/s00371-024-03615-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03615-9","url":null,"abstract":"<p>In aerial imagery analysis, detecting small targets is highly challenging due to their minimal pixel representation and complex backgrounds. To address this issue, this manuscript proposes a novel method for detecting small aerial targets. Firstly, the K-means + + algorithm is utilized to generate anchor boxes suitable for small targets. Secondly, the YOLOv7-BFAW model is proposed. This method incorporates a series of improvements to YOLOv7, including the integration of a BBF residual structure based on BiFormer and BottleNeck fusion into the backbone network, the design of an MPsim module based on simAM attention for the head network, and the development of a novel loss function, inner-WIoU v2, as the localization loss function, based on WIoU v2. Experiments demonstrate that YOLOv7-BFAW achieves a 4.2% mAP@.5 improvement on the DOTA v1.0 dataset and a 1.7% mAP@.5 improvement on the VisDrone2019 dataset, showcasing excellent generalization capabilities. Furthermore, it is shown that YOLOv7-BFAW exhibits superior detection performance compared to state-of-the-art algorithms.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}