The Visual Computer最新文献_第7页

MFDNet: Multi-Frequency Deflare Network for efficient nighttime flare removal MFDNet：多频照明弹网络，用于有效清除夜间照明弹

The Visual Computer

Pub Date : 2024-07-04 DOI: 10.1007/s00371-024-03540-x

Yiguo Jiang, Xuhang Chen, Chi-Man Pun, Shuqiang Wang, Wei Feng

When light is scattered or reflected accidentally in the lens, flare artifacts may appear in the captured photographs, affecting the photographs’ visual quality. The main challenge in flare removal is to eliminate various flare artifacts while preserving the original content of the image. To address this challenge, we propose a lightweight Multi-Frequency Deflare Network (MFDNet) based on the Laplacian Pyramid. Our network decomposes the flare-corrupted image into low- and high-frequency bands, effectively separating the illumination and content information in the image. The low-frequency part typically contains illumination information, while the high-frequency part contains detailed content information. So our MFDNet consists of two main modules: the Low-Frequency Flare Perception Module (LFFPM) to remove flare in the low-frequency part and the Hierarchical Fusion Reconstruction Module (HFRM) to reconstruct the flare-free image. Specifically, to perceive flare from a global perspective while retaining detailed information for image restoration, LFFPM utilizes Transformer to extract global information while utilizing a convolutional neural network to capture detailed local features. Then HFRM gradually fuses the outputs of LFFPM with the high-frequency component of the image through feature aggregation. Moreover, our MFDNet can reduce the computational cost by processing in multiple frequency bands instead of directly removing the flare on the input image. Experimental results demonstrate that our approach outperforms state-of-the-art methods in removing nighttime flare on real-world and synthetic images from the Flare7K dataset. Furthermore, the computational complexity of our model is remarkably low.

当光线在镜头中意外散射或反射时，拍摄的照片中可能会出现耀斑伪影，从而影响照片的视觉质量。消除耀斑的主要挑战是在保留图像原始内容的同时消除各种耀斑伪影。为应对这一挑战，我们提出了一种基于拉普拉斯金字塔的轻量级多频耀斑网络（MFDNet）。我们的网络将耀斑破坏的图像分解为低频和高频段，有效地分离了图像中的光照和内容信息。低频部分通常包含照明信息，而高频部分则包含详细的内容信息。因此，我们的 MFDNet 包括两个主要模块：用于去除低频部分耀斑的低频耀斑感知模块（LFFPM）和用于重建无耀斑图像的分层融合重建模块（HFRM）。具体来说，为了从全局角度感知耀斑，同时保留细节信息用于图像修复，LFFPM 利用变换器提取全局信息，同时利用卷积神经网络捕捉局部细节特征。然后，HFRM 通过特征聚合将 LFFPM 的输出与图像的高频分量逐渐融合。此外，我们的 MFDNet 可以通过多频段处理来降低计算成本，而不是直接去除输入图像上的耀斑。实验结果表明，在去除 Flare7K 数据集中真实世界和合成图像上的夜间耀斑方面，我们的方法优于最先进的方法。此外，我们模型的计算复杂度也非常低。

{"title":"MFDNet: Multi-Frequency Deflare Network for efficient nighttime flare removal","authors":"Yiguo Jiang, Xuhang Chen, Chi-Man Pun, Shuqiang Wang, Wei Feng","doi":"10.1007/s00371-024-03540-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03540-x","url":null,"abstract":"When light is scattered or reflected accidentally in the lens, flare artifacts may appear in the captured photographs, affecting the photographs’ visual quality. The main challenge in flare removal is to eliminate various flare artifacts while preserving the original content of the image. To address this challenge, we propose a lightweight Multi-Frequency Deflare Network (MFDNet) based on the Laplacian Pyramid. Our network decomposes the flare-corrupted image into low- and high-frequency bands, effectively separating the illumination and content information in the image. The low-frequency part typically contains illumination information, while the high-frequency part contains detailed content information. So our MFDNet consists of two main modules: the Low-Frequency Flare Perception Module (LFFPM) to remove flare in the low-frequency part and the Hierarchical Fusion Reconstruction Module (HFRM) to reconstruct the flare-free image. Specifically, to perceive flare from a global perspective while retaining detailed information for image restoration, LFFPM utilizes Transformer to extract global information while utilizing a convolutional neural network to capture detailed local features. Then HFRM gradually fuses the outputs of LFFPM with the high-frequency component of the image through feature aggregation. Moreover, our MFDNet can reduce the computational cost by processing in multiple frequency bands instead of directly removing the flare on the input image. Experimental results demonstrate that our approach outperforms state-of-the-art methods in removing nighttime flare on real-world and synthetic images from the Flare7K dataset. Furthermore, the computational complexity of our model is remarkably low.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-time salient object detection based on accuracy background and salient path source selection 基于精确背景和突出路径源选择的实时突出物体检测

The Visual Computer

Pub Date : 2024-07-03 DOI: 10.1007/s00371-024-03559-0

Wen-Kai Tsai, Hsin-Chih Wang

Boundary and connectivity prior are common methods for detecting the image salient object. They often address two problems: 1) if the salient object touches the image boundary, the saliency of the object will fail, and 2) accurate pixel-wise or superpixel-wise computation needs high time expenditure. This study proposes a block-wise algorithm to reduce calculation time expenditure and suppress the salient objects touching the image boundary. The algorithm consists of four stages. In the first stage, each block is analyzed by an adaptive micro and macro prediction technique to generate a saliency prediction map. The second stage selects background and salient sources from the saliency prediction map. Background sources are extracted from the image boundary with low saliency value. Salient sources are accurately positioned in the region of salient objects. In the third stage, the background and salient sources are used to generate the background path and salient path based on minimum barrier distance. The block-wise initial saliency map is obtained by fusing the background and salient paths. In the fourth stage, major-color modeling technology and visual focus priors are used to complete the refinement of the saliency map to improve the block effect. In the experimental result, the proposed method produced the best test results among other algorithms in three dataset tests and achieved 284 frames per second (FPS) speed performance on the MSRA-10 K dataset. Our method shows at least 29.09% speed improvement and executes in real-time on a lightweight embedded platform.

边界先验法和连接先验法是检测图像突出物体的常用方法。它们通常要解决两个问题：1）如果突出对象触及图像边界，则对象的突出性将失效；2）精确的像素或超像素计算需要耗费大量时间。本研究提出了一种分块算法，以减少计算时间消耗并抑制突出物体触及图像边界。该算法包括四个阶段。第一阶段，采用自适应微观和宏观预测技术分析每个区块，生成显著性预测图。第二阶段从显著性预测图中选择背景和显著源。背景源是从图像边界提取的低显著性值。突出源被精确定位在突出对象区域。第三阶段，利用背景源和突出源生成基于最小障碍距离的背景路径和突出路径。通过融合背景路径和突出路径，得到分块初始突出图。第四阶段，利用主要颜色建模技术和视觉焦点先验来完成对突出图的细化，以改善区块效果。实验结果表明，在三个数据集测试中，所提出的方法在其他算法中取得了最好的测试结果，并在 MSRA-10 K 数据集上实现了每秒 284 帧（FPS）的速度性能。我们的方法至少提高了 29.09% 的速度，并能在轻量级嵌入式平台上实时执行。

{"title":"Real-time salient object detection based on accuracy background and salient path source selection","authors":"Wen-Kai Tsai, Hsin-Chih Wang","doi":"10.1007/s00371-024-03559-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03559-0","url":null,"abstract":"Boundary and connectivity prior are common methods for detecting the image salient object. They often address two problems: 1) if the salient object touches the image boundary, the saliency of the object will fail, and 2) accurate pixel-wise or superpixel-wise computation needs high time expenditure. This study proposes a block-wise algorithm to reduce calculation time expenditure and suppress the salient objects touching the image boundary. The algorithm consists of four stages. In the first stage, each block is analyzed by an adaptive micro and macro prediction technique to generate a saliency prediction map. The second stage selects background and salient sources from the saliency prediction map. Background sources are extracted from the image boundary with low saliency value. Salient sources are accurately positioned in the region of salient objects. In the third stage, the background and salient sources are used to generate the background path and salient path based on minimum barrier distance. The block-wise initial saliency map is obtained by fusing the background and salient paths. In the fourth stage, major-color modeling technology and visual focus priors are used to complete the refinement of the saliency map to improve the block effect. In the experimental result, the proposed method produced the best test results among other algorithms in three dataset tests and achieved 284 frames per second (FPS) speed performance on the MSRA-10 K dataset. Our method shows at least 29.09% speed improvement and executes in real-time on a lightweight embedded platform.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion 通过早期异质融合进行深度多模态学习以增强食品信息

The Visual Computer

Pub Date : 2024-07-03 DOI: 10.1007/s00371-024-03546-5

Avantika Saklani, Shailendra Tiwari, H. S. Pannu

In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.

与单一模态内容相比，多模态数据可以更生动、更有效地深入了解食品统计数据。但传统的食品分类系统侧重于单个模式。由于每天都有大量数据涌现，吸引了这一领域的研究人员，因此这种方法是徒劳的。此外，现有的多模态印度食品数据集非常少。在研究这些发现的基础上，我们建立了一个基于深度多模态融合网络（DAMFN）的新型多模态食品分析模型，以实现语言和视觉的融合。该模型包括三个阶段：功能特征提取、早期融合和特征分类。在功能特征提取中，对来自各个模态的深度特征进行抽象。然后，利用模态之间的深度相关性进行早期融合。最后，将融合后的特征提供给分类系统，以便在特征分类阶段做出最终决定。为了实验目的，我们进一步开发了一个数据集，其中包含印度食品图像及其相关说明。此外，我们还在一个名为 UPMC Food 101 的大型数据集上对所提出的方法进行了评估，该数据集共有 90 704 个实例。实验结果表明，所提出的 DAMFN 优于几种最先进的多模态食品分类技术以及单个模态系统。

{"title":"Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion","authors":"Avantika Saklani, Shailendra Tiwari, H. S. Pannu","doi":"10.1007/s00371-024-03546-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03546-5","url":null,"abstract":"In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"92 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Transmission-guided multi-feature fusion Dehaze network 传输引导的多特征融合 Dehaze 网络

The Visual Computer

Pub Date : 2024-07-03 DOI: 10.1007/s00371-024-03533-w

Xiaoyang Zhao, Zhuo Wang, Zhongchao Deng, Hongde Qin, Zhongben Zhu

Image dehazing is an important direction of low-level visual tasks, and its quality and efficiency directly affect the quality of high-level visual tasks. Therefore, how to quickly and efficiently process hazy images with different thicknesses of fog has become the focus of research. This paper presents a multi-feature fusion embedded image dehazing network based on transmission guidance. Firstly, we propose a transmission graph-guided feature fusion enhanced coding network, which can combine different weight information and show better flexibility for different dehazing information. At the same time, in order to keep more detailed information in the reconstructed image, we propose a decoder network embedded with Mix module, which can not only keep shallow information, but also allow the network to learn the weights of different depth information spontaneously and re-fit the dehazing features. The comparative experiments on RESIDE and Haze4K datasets verify the efficiency and high quality of our algorithm. A series of ablation experiments show that Multi-weight attention feature fusion module (WA) module and Mix module can effectively improve the model performance. The code is released in https://doi.org/10.5281/zenodo.10836919.

图像去毛刺是低级视觉任务的一个重要方向，其质量和效率直接影响高级视觉任务的质量。因此，如何快速高效地处理不同厚度雾气的朦胧图像成为研究的重点。本文提出了一种基于传输引导的多特征融合嵌入式图像去噪网络。首先，我们提出了一种传输图引导的特征融合增强编码网络，该网络可以结合不同的权重信息，对不同的除杂信息表现出更好的灵活性。同时，为了在重建图像中保留更多细节信息，我们提出了一种嵌入混合模块的解码网络，它不仅能保留浅层信息，还能让网络自发学习不同深度信息的权重，并重新拟合去毛刺特征。在 RESIDE 和 Haze4K 数据集上的对比实验验证了我们算法的高效性和高质量。一系列消融实验表明，多权注意特征融合模块（WA）和混合模块能有效提高模型性能。代码发布于 https://doi.org/10.5281/zenodo.10836919。

{"title":"Transmission-guided multi-feature fusion Dehaze network","authors":"Xiaoyang Zhao, Zhuo Wang, Zhongchao Deng, Hongde Qin, Zhongben Zhu","doi":"10.1007/s00371-024-03533-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03533-w","url":null,"abstract":"Image dehazing is an important direction of low-level visual tasks, and its quality and efficiency directly affect the quality of high-level visual tasks. Therefore, how to quickly and efficiently process hazy images with different thicknesses of fog has become the focus of research. This paper presents a multi-feature fusion embedded image dehazing network based on transmission guidance. Firstly, we propose a transmission graph-guided feature fusion enhanced coding network, which can combine different weight information and show better flexibility for different dehazing information. At the same time, in order to keep more detailed information in the reconstructed image, we propose a decoder network embedded with Mix module, which can not only keep shallow information, but also allow the network to learn the weights of different depth information spontaneously and re-fit the dehazing features. The comparative experiments on RESIDE and Haze4K datasets verify the efficiency and high quality of our algorithm. A series of ablation experiments show that Multi-weight attention feature fusion module (WA) module and Mix module can effectively improve the model performance. The code is released in https://doi.org/10.5281/zenodo.10836919.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic guidance incremental network for efficiency video super-resolution 用于高效视频超分辨率的语义引导增量网络

The Visual Computer

Pub Date : 2024-07-02 DOI: 10.1007/s00371-024-03488-y

Xiaonan He, Yukun Xia, Yuansong Qiao, Brian Lee, Yuhang Ye

In video streaming, bandwidth constraints significantly affect client-side video quality. Addressing this, deep neural networks offer a promising avenue for implementing video super-resolution (VSR) at the user end, leveraging advancements in modern hardware, including mobile devices. The principal challenge in VSR is the computational intensity involved in processing temporal/spatial video data. Conventional methods, uniformly processing entire scenes, often result in inefficient resource allocation. This is evident in the over-processing of simpler regions and insufficient attention to complex regions, leading to edge artifacts in merged regions. Our innovative approach employs semantic segmentation and spatial frequency-based categorization to divide each video frame into regions of varying complexity: simple, medium, and complex. These are then processed through an efficient incremental model, optimizing computational resources. A key innovation is the sparse temporal/spatial feature transformation layer, which mitigates edge artifacts and ensures seamless integration of regional features, enhancing the naturalness of the super-resolution outcome. Experimental results demonstrate that our method significantly boosts VSR efficiency while maintaining effectiveness. This marks a notable advancement in streaming video technology, optimizing video quality with reduced computational demands. This approach, featuring semantic segmentation, spatial frequency analysis, and an incremental network structure, represents a substantial improvement over traditional VSR methodologies, addressing the core challenges of efficiency and quality in high-resolution video streaming.

在视频流中，带宽限制严重影响了客户端的视频质量。为解决这一问题，深度神经网络利用现代硬件（包括移动设备）的进步，为在用户端实现视频超分辨率（VSR）提供了一条大有可为的途径。VSR 面临的主要挑战是处理时间/空间视频数据的计算强度。对整个场景进行统一处理的传统方法往往导致资源分配效率低下。这表现在对较简单区域的过度处理和对复杂区域的关注不够，导致合并区域出现边缘伪影。我们的创新方法采用语义分割和基于空间频率的分类，将每个视频帧划分为不同复杂度的区域：简单、中等和复杂。然后通过高效的增量模型对这些区域进行处理，从而优化计算资源。稀疏的时间/空间特征转换层是一个关键的创新点，它可以减少边缘伪影，确保区域特征的无缝整合，提高超分辨率结果的自然度。实验结果表明，我们的方法在保持有效性的同时，显著提高了 VSR 的效率。这标志着流媒体视频技术的显著进步，在降低计算需求的同时优化了视频质量。这种方法以语义分割、空间频率分析和增量网络结构为特色，与传统的 VSR 方法相比有了很大改进，解决了高分辨率视频流在效率和质量方面的核心难题。

{"title":"Semantic guidance incremental network for efficiency video super-resolution","authors":"Xiaonan He, Yukun Xia, Yuansong Qiao, Brian Lee, Yuhang Ye","doi":"10.1007/s00371-024-03488-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03488-y","url":null,"abstract":"In video streaming, bandwidth constraints significantly affect client-side video quality. Addressing this, deep neural networks offer a promising avenue for implementing video super-resolution (VSR) at the user end, leveraging advancements in modern hardware, including mobile devices. The principal challenge in VSR is the computational intensity involved in processing temporal/spatial video data. Conventional methods, uniformly processing entire scenes, often result in inefficient resource allocation. This is evident in the over-processing of simpler regions and insufficient attention to complex regions, leading to edge artifacts in merged regions. Our innovative approach employs semantic segmentation and spatial frequency-based categorization to divide each video frame into regions of varying complexity: simple, medium, and complex. These are then processed through an efficient incremental model, optimizing computational resources. A key innovation is the sparse temporal/spatial feature transformation layer, which mitigates edge artifacts and ensures seamless integration of regional features, enhancing the naturalness of the super-resolution outcome. Experimental results demonstrate that our method significantly boosts VSR efficiency while maintaining effectiveness. This marks a notable advancement in streaming video technology, optimizing video quality with reduced computational demands. This approach, featuring semantic segmentation, spatial frequency analysis, and an incremental network structure, represents a substantial improvement over traditional VSR methodologies, addressing the core challenges of efficiency and quality in high-resolution video streaming.\u0000","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Effective multi-scale enhancement fusion method for low-light images based on interest-area perception OCTM and “pixel healthiness” evaluation 基于兴趣区感知 OCTM 和 "像素健康度 "评估的低照度图像有效多尺度增强融合方法

The Visual Computer

Pub Date : 2024-07-02 DOI: 10.1007/s00371-024-03554-5

Yi-lun Wang, Yi-zheng Lang, Yun-sheng Qian

Low-light images suffer from low contrast and low dynamic range. However, most existing single-frame low-light image enhancement algorithms are not good enough in terms of detail preservation and color expression and often have high algorithmic complexity. In this paper, we propose a single-frame low-light image fusion enhancement algorithm based on multi-scale contrast–tone mapping and "pixel healthiness" evaluation. It can adaptively adjust the exposure level of each region according to the principal component in the image and enhance contrast while preserving color and detail expression with low computational complexity. In particular, to find the most appropriate size of the artificial image sequence and the target enhancement range for each image, we propose a multi-scale parameter determination method based on the principal component analysis of the V-channel histogram to obtain the best enhancement while reducing unnecessary computations. In addition, a new "pixel healthiness" evaluation method based on global illuminance and local contrast is proposed for fast and efficient computation of weights for image fusion. Subjective evaluation and objective metrics show that our algorithm performs better than existing single-frame image algorithms and other fusion-based algorithms in enhancement, contrast, color expression, and detail preservation.

低照度图像具有低对比度和低动态范围的问题。然而，现有的单帧低照度图像增强算法大多在细节保留和色彩表达方面不够理想，而且算法复杂度往往较高。本文提出了一种基于多尺度对比度映射和 "像素健康度 "评估的单帧弱光图像融合增强算法。它能根据图像中的主成分自适应地调整每个区域的曝光水平，在增强对比度的同时保留色彩和细节表达，且计算复杂度较低。其中，为了找到最合适的人工图像序列大小和每幅图像的目标增强范围，我们提出了一种基于 V 信道直方图主成分分析的多尺度参数确定方法，以获得最佳增强效果，同时减少不必要的计算。此外，我们还提出了一种基于全局照度和局部对比度的新型 "像素健康度 "评估方法，用于快速高效地计算图像融合的权重。主观评价和客观指标表明，我们的算法在增强、对比度、色彩表达和细节保留方面都优于现有的单帧图像算法和其他基于融合的算法。

{"title":"Effective multi-scale enhancement fusion method for low-light images based on interest-area perception OCTM and “pixel healthiness” evaluation","authors":"Yi-lun Wang, Yi-zheng Lang, Yun-sheng Qian","doi":"10.1007/s00371-024-03554-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03554-5","url":null,"abstract":"Low-light images suffer from low contrast and low dynamic range. However, most existing single-frame low-light image enhancement algorithms are not good enough in terms of detail preservation and color expression and often have high algorithmic complexity. In this paper, we propose a single-frame low-light image fusion enhancement algorithm based on multi-scale contrast–tone mapping and \"pixel healthiness\" evaluation. It can adaptively adjust the exposure level of each region according to the principal component in the image and enhance contrast while preserving color and detail expression with low computational complexity. In particular, to find the most appropriate size of the artificial image sequence and the target enhancement range for each image, we propose a multi-scale parameter determination method based on the principal component analysis of the V-channel histogram to obtain the best enhancement while reducing unnecessary computations. In addition, a new \"pixel healthiness\" evaluation method based on global illuminance and local contrast is proposed for fast and efficient computation of weights for image fusion. Subjective evaluation and objective metrics show that our algorithm performs better than existing single-frame image algorithms and other fusion-based algorithms in enhancement, contrast, color expression, and detail preservation.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring high-quality image deraining Transformer via effective large kernel attention 通过有效的大内核关注探索高质量图像派生变换器

The Visual Computer

Pub Date : 2024-07-02 DOI: 10.1007/s00371-024-03551-8

Haobo Dong, Tianyu Song, Xuanyu Qi, Jiyu Jin, Guiyue Jin, Lei Fan

In recent years, Transformer has demonstrated significant performance in single image deraining tasks. However, the standard self-attention in the Transformer makes it difficult to model local features of images effectively. To alleviate the above problem, this paper proposes a high-quality deraining Transformer with effective large kernel attention, named as ELKAformer. The network employs the Transformer-Style Effective Large Kernel Conv-Block (ELKB), which contains 3 key designs: Large Kernel Attention Block (LKAB), Dynamical Enhancement Feed-forward Network (DEFN), and Edge Squeeze Recovery Block (ESRB) to guide the extraction of rich features. To be specific, LKAB introduces convolutional modulation to substitute vanilla self-attention and achieve better local representations. The designed DEFN refines the most valuable attention values in LKAB, allowing the overall design to better preserve pixel-wise information. Additionally, we develop ESRB to obtain long-range dependencies of different positional information. Massive experimental results demonstrate that this method achieves favorable effects while effectively saving computational costs. Our code is available at github

近年来，Transformer 在单幅图像派生任务中表现出了显著的性能。然而，Transformer 中的标准自关注使得它难以对图像的局部特征进行有效建模。为了解决上述问题，本文提出了一种具有有效大内核注意力的高质量派生变换器，并将其命名为 ELKAformer。该网络采用了 Transformer-Style Effective Large Kernel Conv-Block (ELKB)，其中包含 3 个关键设计：大型内核注意块（LKAB）、动态增强前馈网络（DEFN）和边缘挤压恢复块（ESRB），用于指导提取丰富的特征。具体来说，LKAB 引入了卷积调制，以替代虚无自注意，实现更好的局部表征。所设计的 DEFN 提炼出了 LKAB 中最有价值的注意力值，使整体设计能够更好地保存像素信息。此外，我们还开发了 ESRB，以获得不同位置信息的长程依赖性。大量实验结果表明，这种方法在取得良好效果的同时，还有效地节约了计算成本。我们的代码可在 github

{"title":"Exploring high-quality image deraining Transformer via effective large kernel attention","authors":"Haobo Dong, Tianyu Song, Xuanyu Qi, Jiyu Jin, Guiyue Jin, Lei Fan","doi":"10.1007/s00371-024-03551-8","DOIUrl":"https://doi.org/10.1007/s00371-024-03551-8","url":null,"abstract":"In recent years, Transformer has demonstrated significant performance in single image deraining tasks. However, the standard self-attention in the Transformer makes it difficult to model local features of images effectively. To alleviate the above problem, this paper proposes a high-quality deraining Transformer with effective large kernel attention, named as ELKAformer. The network employs the Transformer-Style Effective Large Kernel Conv-Block (ELKB), which contains 3 key designs: Large Kernel Attention Block (LKAB), Dynamical Enhancement Feed-forward Network (DEFN), and Edge Squeeze Recovery Block (ESRB) to guide the extraction of rich features. To be specific, LKAB introduces convolutional modulation to substitute vanilla self-attention and achieve better local representations. The designed DEFN refines the most valuable attention values in LKAB, allowing the overall design to better preserve pixel-wise information. Additionally, we develop ESRB to obtain long-range dependencies of different positional information. Massive experimental results demonstrate that this method achieves favorable effects while effectively saving computational costs. Our code is available at github","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative adversarial networks for handwriting image generation: a review 用于手写图像生成的生成对抗网络：综述

The Visual Computer

Pub Date : 2024-07-02 DOI: 10.1007/s00371-024-03534-9

Randa Elanwar, Margrit Betke

Handwriting synthesis, the task of automatically generating realistic images of handwritten text, has gained increasing attention in recent years, both as a challenge in itself, as well as a task that supports handwriting recognition research. The latter task is to synthesize large image datasets that can then be used to train deep learning models to recognize handwritten text without the need for human-provided annotations. While early attempts at developing handwriting generators yielded limited results [1], more recent works involving generative models of deep neural network architectures have been shown able to produce realistic imitations of human handwriting [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. In this review, we focus on one of the most prevalent and successful architectures in the field of handwriting synthesis, the generative adversarial network (GAN). We describe the capabilities, architecture specifics, and performance of the GAN-based models that have been introduced to the literature since 2019 [2,3,4,5,6,7,8,9,10,11,12,13,14]. These models can generate random handwriting styles, imitate reference styles, and produce realistic images of arbitrary text that was not in the training lexicon. The generated images have been shown to contribute to improving handwriting recognition results when augmenting the training samples of recognition models with synthetic images. The synthetic images were often hard to expose as non-real, even by human examiners, but also could be implausible or style-limited. The review includes a discussion of the characteristics of the GAN architecture in comparison with other paradigms in the image-generation domain and highlights the remaining challenges for handwriting synthesis.

手写合成是一项自动生成逼真手写文本图像的任务，近年来越来越受到关注，它本身既是一项挑战，也是一项支持手写识别研究的任务。后者的任务是合成大型图像数据集，然后用于训练深度学习模型来识别手写文本，而无需人类提供注释。虽然早期开发手写生成器的尝试成果有限[1]，但最近涉及深度神经网络架构生成模型的工作已经证明能够生成逼真的人类手写模仿[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]。在本综述中，我们将重点介绍手写合成领域最流行、最成功的架构之一--生成式对抗网络 (GAN)。我们将介绍自 2019 年以来文献[2,3,4,5,6,7,8,9,10,11,12,13,14]中介绍的基于 GAN 模型的功能、架构细节和性能。这些模型可以生成随机笔迹样式、模仿参考样式，并生成不在训练词典中的任意文本的逼真图像。事实证明，在用合成图像增强识别模型的训练样本时，生成的图像有助于改善手写识别结果。合成图像通常很难被揭示为非真实图像，即使是人类检查员也很难发现，但也可能是不可信的或有风格限制的。这篇综述将 GAN 架构的特点与图像生成领域的其他范例进行了比较讨论，并强调了手写合成仍面临的挑战。

{"title":"Generative adversarial networks for handwriting image generation: a review","authors":"Randa Elanwar, Margrit Betke","doi":"10.1007/s00371-024-03534-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03534-9","url":null,"abstract":"Handwriting synthesis, the task of automatically generating realistic images of handwritten text, has gained increasing attention in recent years, both as a challenge in itself, as well as a task that supports handwriting recognition research. The latter task is to synthesize large image datasets that can then be used to train deep learning models to recognize handwritten text without the need for human-provided annotations. While early attempts at developing handwriting generators yielded limited results [1], more recent works involving generative models of deep neural network architectures have been shown able to produce realistic imitations of human handwriting [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. In this review, we focus on one of the most prevalent and successful architectures in the field of handwriting synthesis, the generative adversarial network (GAN). We describe the capabilities, architecture specifics, and performance of the GAN-based models that have been introduced to the literature since 2019 [2,3,4,5,6,7,8,9,10,11,12,13,14]. These models can generate random handwriting styles, imitate reference styles, and produce realistic images of arbitrary text that was not in the training lexicon. The generated images have been shown to contribute to improving handwriting recognition results when augmenting the training samples of recognition models with synthetic images. The synthetic images were often hard to expose as non-real, even by human examiners, but also could be implausible or style-limited. The review includes a discussion of the characteristics of the GAN architecture in comparison with other paradigms in the image-generation domain and highlights the remaining challenges for handwriting synthesis.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Artistic style decomposition for texture and shape editing 艺术风格分解，用于纹理和形状编辑

The Visual Computer

Pub Date : 2024-07-01 DOI: 10.1007/s00371-024-03521-0

Max Reimann, Martin Büßemeyer, Benito Buchheim, Amir Semmo, Jürgen Döllner, Matthias Trapp

While methods for generative image synthesis and example-based stylization produce impressive results, their black-box style representation intertwines shape, texture, and color aspects, limiting precise stylistic control and editing of artistic images. We introduce a novel method for decomposing the style of an artistic image that enables interactive geometric shape abstraction and texture control. We spatially decompose the input image into geometric shapes and an overlaying parametric texture representation, facilitating independent manipulation of color and texture. The parameters in this texture representation, comprising the image’s high-frequency details, control painterly attributes in a series of differentiable stylization filters. Shape decomposition is achieved using either segmentation or stroke-based neural rendering techniques. We demonstrate that our shape and texture decoupling enables diverse stylistic edits, including adjustments in shape, stroke, and painterly attributes such as contours and surface relief. Moreover, we demonstrate shape and texture style transfer in the parametric space using both reference images and text prompts and accelerate these by training networks for single- and arbitrary-style parameter prediction.

虽然生成式图像合成和基于示例的风格化方法能产生令人印象深刻的结果，但其黑盒风格表示法将形状、纹理和颜色方面交织在一起，限制了对艺术图像的精确风格控制和编辑。我们介绍了一种分解艺术图像风格的新方法，该方法可实现交互式几何形状抽象和纹理控制。我们将输入图像在空间上分解为几何形状和叠加的参数化纹理表示，从而方便对色彩和纹理进行独立操作。纹理表示中的参数包括图像的高频细节，可在一系列可微调的风格化过滤器中控制绘画属性。形状分解是通过分割或基于笔触的神经渲染技术实现的。我们证明，我们的形状和纹理解耦技术可实现多种风格编辑，包括形状、笔触和绘画属性（如轮廓和表面浮雕）的调整。此外，我们还利用参考图像和文本提示演示了参数空间中的形状和纹理风格转移，并通过训练网络进行单一和任意风格参数预测来加速这些转移。

{"title":"Artistic style decomposition for texture and shape editing","authors":"Max Reimann, Martin Büßemeyer, Benito Buchheim, Amir Semmo, Jürgen Döllner, Matthias Trapp","doi":"10.1007/s00371-024-03521-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03521-0","url":null,"abstract":"While methods for generative image synthesis and example-based stylization produce impressive results, their black-box style representation intertwines shape, texture, and color aspects, limiting precise stylistic control and editing of artistic images. We introduce a novel method for decomposing the style of an artistic image that enables interactive geometric shape abstraction and texture control. We spatially decompose the input image into geometric shapes and an overlaying parametric texture representation, facilitating independent manipulation of color and texture. The parameters in this texture representation, comprising the image’s high-frequency details, control painterly attributes in a series of differentiable stylization filters. Shape decomposition is achieved using either segmentation or stroke-based neural rendering techniques. We demonstrate that our shape and texture decoupling enables diverse stylistic edits, including adjustments in shape, stroke, and painterly attributes such as contours and surface relief. Moreover, we demonstrate shape and texture style transfer in the parametric space using both reference images and text prompts and accelerate these by training networks for single- and arbitrary-style parameter prediction.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LERFNet: an enlarged effective receptive field backbone network for enhancing visual drone detection LERFNet：用于增强视觉无人机探测的扩大有效感受野骨干网络

The Visual Computer

Pub Date : 2024-07-01 DOI: 10.1007/s00371-024-03527-8

Mohamed Elsayed, Mohamed Reda, Ahmed S. Mashaly, Ahmed S. Amein

Recently, the world has witnessed a great increase in drone applications and missions. Drones must be detected quickly, effectively, and precisely when they are being handled illegally. Vision-based anti-drone systems provide an efficient performance compared to radar- and acoustic-based systems. The effectiveness of drone detection is affected by a number of issues, including the drone’s small size, conflicts with other objects, and noisy backgrounds. This paper employs enlarging the effective receptive field (ERF) of feature maps generated from the YOLOv6 backbone. First, RepLKNet is used as the backbone of YOLOv6, which deploys large kernels with depth-wise convolution. Then, to get beyond RepLKNet’s large inference time, a novel LERFNet is implemented. LERFNet uses dilated convolution in addition to large kernels to enlarge the ERF and overcome each other’s problems. The linear spatial-channel attention module (LAM) is used to give more attention to the most informative pixels and high feature channels. LERFNet produces output feature maps with a large ERF and high shape bias to enhance the detection of various drone sizes in complex scenes. The RepLKNet and LERFNet backbones for Tiny-YOLOv6, Tiny-YOLOv6, YOLOv5s, and Tiny-YOLOv7 are compared. In comparison to the aforementioned techniques, the suggested model’s results show a greater balance between accuracy and speed. LERFNet increases the MAP by (2.8%), while significantly reducing the GFLOPs and parameter numbers when compared to the original backbone of YOLOv6.

最近，全世界的无人机应用和任务大幅增加。当无人机被非法操控时，必须对其进行快速、有效和精确的检测。与基于雷达和声学的系统相比，基于视觉的反无人机系统具有更高效的性能。无人机检测的有效性受到一系列问题的影响，包括无人机的小尺寸、与其他物体的冲突以及嘈杂的背景。本文采用了扩大 YOLOv6 主干网生成的特征图的有效感受野（ERF）的方法。首先，将 RepLKNet 用作 YOLOv6 的骨干，它部署了深度卷积的大内核。然后，为了克服 RepLKNet 的庞大推理时间，我们实施了一个新颖的 LERFNet。LERFNet 除了使用大内核外，还使用了扩张卷积，以扩大 ERF 并克服彼此的问题。线性空间通道关注模块（LAM）用于更多地关注信息量最大的像素和高特征通道。LERFNet 生成的输出特征图具有较大的 ERF 和较高的形状偏置，可增强对复杂场景中各种大小无人机的检测。比较了 Tiny-YOLOv6、Tiny-YOLOv6、YOLOv5s 和 Tiny-YOLOv7 的 RepLKNet 和 LERFNet 主干网。与上述技术相比，建议模型的结果显示在准确性和速度之间取得了更好的平衡。与 YOLOv6 的原始骨干网相比，LERFNet 将 MAP 提高了（2.8%），同时大幅减少了 GFLOPs 和参数数。

{"title":"LERFNet: an enlarged effective receptive field backbone network for enhancing visual drone detection","authors":"Mohamed Elsayed, Mohamed Reda, Ahmed S. Mashaly, Ahmed S. Amein","doi":"10.1007/s00371-024-03527-8","DOIUrl":"https://doi.org/10.1007/s00371-024-03527-8","url":null,"abstract":"Recently, the world has witnessed a great increase in drone applications and missions. Drones must be detected quickly, effectively, and precisely when they are being handled illegally. Vision-based anti-drone systems provide an efficient performance compared to radar- and acoustic-based systems. The effectiveness of drone detection is affected by a number of issues, including the drone’s small size, conflicts with other objects, and noisy backgrounds. This paper employs enlarging the effective receptive field (ERF) of feature maps generated from the YOLOv6 backbone. First, RepLKNet is used as the backbone of YOLOv6, which deploys large kernels with depth-wise convolution. Then, to get beyond RepLKNet’s large inference time, a novel LERFNet is implemented. LERFNet uses dilated convolution in addition to large kernels to enlarge the ERF and overcome each other’s problems. The linear spatial-channel attention module (LAM) is used to give more attention to the most informative pixels and high feature channels. LERFNet produces output feature maps with a large ERF and high shape bias to enhance the detection of various drone sizes in complex scenes. The RepLKNet and LERFNet backbones for Tiny-YOLOv6, Tiny-YOLOv6, YOLOv5s, and Tiny-YOLOv7 are compared. In comparison to the aforementioned techniques, the suggested model’s results show a greater balance between accuracy and speed. LERFNet increases the MAP by (2.8%), while significantly reducing the GFLOPs and parameter numbers when compared to the original backbone of YOLOv6.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0