When light is scattered or reflected accidentally in the lens, flare artifacts may appear in the captured photographs, affecting the photographs’ visual quality. The main challenge in flare removal is to eliminate various flare artifacts while preserving the original content of the image. To address this challenge, we propose a lightweight Multi-Frequency Deflare Network (MFDNet) based on the Laplacian Pyramid. Our network decomposes the flare-corrupted image into low- and high-frequency bands, effectively separating the illumination and content information in the image. The low-frequency part typically contains illumination information, while the high-frequency part contains detailed content information. So our MFDNet consists of two main modules: the Low-Frequency Flare Perception Module (LFFPM) to remove flare in the low-frequency part and the Hierarchical Fusion Reconstruction Module (HFRM) to reconstruct the flare-free image. Specifically, to perceive flare from a global perspective while retaining detailed information for image restoration, LFFPM utilizes Transformer to extract global information while utilizing a convolutional neural network to capture detailed local features. Then HFRM gradually fuses the outputs of LFFPM with the high-frequency component of the image through feature aggregation. Moreover, our MFDNet can reduce the computational cost by processing in multiple frequency bands instead of directly removing the flare on the input image. Experimental results demonstrate that our approach outperforms state-of-the-art methods in removing nighttime flare on real-world and synthetic images from the Flare7K dataset. Furthermore, the computational complexity of our model is remarkably low.
{"title":"MFDNet: Multi-Frequency Deflare Network for efficient nighttime flare removal","authors":"Yiguo Jiang, Xuhang Chen, Chi-Man Pun, Shuqiang Wang, Wei Feng","doi":"10.1007/s00371-024-03540-x","DOIUrl":"https://doi.org/10.1007/s00371-024-03540-x","url":null,"abstract":"<p>When light is scattered or reflected accidentally in the lens, flare artifacts may appear in the captured photographs, affecting the photographs’ visual quality. The main challenge in flare removal is to eliminate various flare artifacts while preserving the original content of the image. To address this challenge, we propose a lightweight Multi-Frequency Deflare Network (MFDNet) based on the Laplacian Pyramid. Our network decomposes the flare-corrupted image into low- and high-frequency bands, effectively separating the illumination and content information in the image. The low-frequency part typically contains illumination information, while the high-frequency part contains detailed content information. So our MFDNet consists of two main modules: the Low-Frequency Flare Perception Module (LFFPM) to remove flare in the low-frequency part and the Hierarchical Fusion Reconstruction Module (HFRM) to reconstruct the flare-free image. Specifically, to perceive flare from a global perspective while retaining detailed information for image restoration, LFFPM utilizes Transformer to extract global information while utilizing a convolutional neural network to capture detailed local features. Then HFRM gradually fuses the outputs of LFFPM with the high-frequency component of the image through feature aggregation. Moreover, our MFDNet can reduce the computational cost by processing in multiple frequency bands instead of directly removing the flare on the input image. Experimental results demonstrate that our approach outperforms state-of-the-art methods in removing nighttime flare on real-world and synthetic images from the Flare7K dataset. Furthermore, the computational complexity of our model is remarkably low.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-03DOI: 10.1007/s00371-024-03559-0
Wen-Kai Tsai, Hsin-Chih Wang
Boundary and connectivity prior are common methods for detecting the image salient object. They often address two problems: 1) if the salient object touches the image boundary, the saliency of the object will fail, and 2) accurate pixel-wise or superpixel-wise computation needs high time expenditure. This study proposes a block-wise algorithm to reduce calculation time expenditure and suppress the salient objects touching the image boundary. The algorithm consists of four stages. In the first stage, each block is analyzed by an adaptive micro and macro prediction technique to generate a saliency prediction map. The second stage selects background and salient sources from the saliency prediction map. Background sources are extracted from the image boundary with low saliency value. Salient sources are accurately positioned in the region of salient objects. In the third stage, the background and salient sources are used to generate the background path and salient path based on minimum barrier distance. The block-wise initial saliency map is obtained by fusing the background and salient paths. In the fourth stage, major-color modeling technology and visual focus priors are used to complete the refinement of the saliency map to improve the block effect. In the experimental result, the proposed method produced the best test results among other algorithms in three dataset tests and achieved 284 frames per second (FPS) speed performance on the MSRA-10 K dataset. Our method shows at least 29.09% speed improvement and executes in real-time on a lightweight embedded platform.
边界先验法和连接先验法是检测图像突出物体的常用方法。它们通常要解决两个问题:1)如果突出对象触及图像边界,则对象的突出性将失效;2)精确的像素或超像素计算需要耗费大量时间。本研究提出了一种分块算法,以减少计算时间消耗并抑制突出物体触及图像边界。该算法包括四个阶段。第一阶段,采用自适应微观和宏观预测技术分析每个区块,生成显著性预测图。第二阶段从显著性预测图中选择背景和显著源。背景源是从图像边界提取的低显著性值。突出源被精确定位在突出对象区域。第三阶段,利用背景源和突出源生成基于最小障碍距离的背景路径和突出路径。通过融合背景路径和突出路径,得到分块初始突出图。第四阶段,利用主要颜色建模技术和视觉焦点先验来完成对突出图的细化,以改善区块效果。实验结果表明,在三个数据集测试中,所提出的方法在其他算法中取得了最好的测试结果,并在 MSRA-10 K 数据集上实现了每秒 284 帧(FPS)的速度性能。我们的方法至少提高了 29.09% 的速度,并能在轻量级嵌入式平台上实时执行。
{"title":"Real-time salient object detection based on accuracy background and salient path source selection","authors":"Wen-Kai Tsai, Hsin-Chih Wang","doi":"10.1007/s00371-024-03559-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03559-0","url":null,"abstract":"<p>Boundary and connectivity prior are common methods for detecting the image salient object. They often address two problems: 1) if the salient object touches the image boundary, the saliency of the object will fail, and 2) accurate pixel-wise or superpixel-wise computation needs high time expenditure. This study proposes a block-wise algorithm to reduce calculation time expenditure and suppress the salient objects touching the image boundary. The algorithm consists of four stages. In the first stage, each block is analyzed by an adaptive micro and macro prediction technique to generate a saliency prediction map. The second stage selects background and salient sources from the saliency prediction map. Background sources are extracted from the image boundary with low saliency value. Salient sources are accurately positioned in the region of salient objects. In the third stage, the background and salient sources are used to generate the background path and salient path based on minimum barrier distance. The block-wise initial saliency map is obtained by fusing the background and salient paths. In the fourth stage, major-color modeling technology and visual focus priors are used to complete the refinement of the saliency map to improve the block effect. In the experimental result, the proposed method produced the best test results among other algorithms in three dataset tests and achieved 284 frames per second (FPS) speed performance on the MSRA-10 K dataset. Our method shows at least 29.09% speed improvement and executes in real-time on a lightweight embedded platform.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-03DOI: 10.1007/s00371-024-03546-5
Avantika Saklani, Shailendra Tiwari, H. S. Pannu
In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.
{"title":"Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion","authors":"Avantika Saklani, Shailendra Tiwari, H. S. Pannu","doi":"10.1007/s00371-024-03546-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03546-5","url":null,"abstract":"<p>In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"92 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image dehazing is an important direction of low-level visual tasks, and its quality and efficiency directly affect the quality of high-level visual tasks. Therefore, how to quickly and efficiently process hazy images with different thicknesses of fog has become the focus of research. This paper presents a multi-feature fusion embedded image dehazing network based on transmission guidance. Firstly, we propose a transmission graph-guided feature fusion enhanced coding network, which can combine different weight information and show better flexibility for different dehazing information. At the same time, in order to keep more detailed information in the reconstructed image, we propose a decoder network embedded with Mix module, which can not only keep shallow information, but also allow the network to learn the weights of different depth information spontaneously and re-fit the dehazing features. The comparative experiments on RESIDE and Haze4K datasets verify the efficiency and high quality of our algorithm. A series of ablation experiments show that Multi-weight attention feature fusion module (WA) module and Mix module can effectively improve the model performance. The code is released in https://doi.org/10.5281/zenodo.10836919.
{"title":"Transmission-guided multi-feature fusion Dehaze network","authors":"Xiaoyang Zhao, Zhuo Wang, Zhongchao Deng, Hongde Qin, Zhongben Zhu","doi":"10.1007/s00371-024-03533-w","DOIUrl":"https://doi.org/10.1007/s00371-024-03533-w","url":null,"abstract":"<p>Image dehazing is an important direction of low-level visual tasks, and its quality and efficiency directly affect the quality of high-level visual tasks. Therefore, how to quickly and efficiently process hazy images with different thicknesses of fog has become the focus of research. This paper presents a multi-feature fusion embedded image dehazing network based on transmission guidance. Firstly, we propose a transmission graph-guided feature fusion enhanced coding network, which can combine different weight information and show better flexibility for different dehazing information. At the same time, in order to keep more detailed information in the reconstructed image, we propose a decoder network embedded with Mix module, which can not only keep shallow information, but also allow the network to learn the weights of different depth information spontaneously and re-fit the dehazing features. The comparative experiments on RESIDE and Haze4K datasets verify the efficiency and high quality of our algorithm. A series of ablation experiments show that Multi-weight attention feature fusion module (WA) module and Mix module can effectively improve the model performance. The code is released in https://doi.org/10.5281/zenodo.10836919.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1007/s00371-024-03488-y
Xiaonan He, Yukun Xia, Yuansong Qiao, Brian Lee, Yuhang Ye
In video streaming, bandwidth constraints significantly affect client-side video quality. Addressing this, deep neural networks offer a promising avenue for implementing video super-resolution (VSR) at the user end, leveraging advancements in modern hardware, including mobile devices. The principal challenge in VSR is the computational intensity involved in processing temporal/spatial video data. Conventional methods, uniformly processing entire scenes, often result in inefficient resource allocation. This is evident in the over-processing of simpler regions and insufficient attention to complex regions, leading to edge artifacts in merged regions. Our innovative approach employs semantic segmentation and spatial frequency-based categorization to divide each video frame into regions of varying complexity: simple, medium, and complex. These are then processed through an efficient incremental model, optimizing computational resources. A key innovation is the sparse temporal/spatial feature transformation layer, which mitigates edge artifacts and ensures seamless integration of regional features, enhancing the naturalness of the super-resolution outcome. Experimental results demonstrate that our method significantly boosts VSR efficiency while maintaining effectiveness. This marks a notable advancement in streaming video technology, optimizing video quality with reduced computational demands. This approach, featuring semantic segmentation, spatial frequency analysis, and an incremental network structure, represents a substantial improvement over traditional VSR methodologies, addressing the core challenges of efficiency and quality in high-resolution video streaming.
{"title":"Semantic guidance incremental network for efficiency video super-resolution","authors":"Xiaonan He, Yukun Xia, Yuansong Qiao, Brian Lee, Yuhang Ye","doi":"10.1007/s00371-024-03488-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03488-y","url":null,"abstract":"<p>In video streaming, bandwidth constraints significantly affect client-side video quality. Addressing this, deep neural networks offer a promising avenue for implementing video super-resolution (VSR) at the user end, leveraging advancements in modern hardware, including mobile devices. The principal challenge in VSR is the computational intensity involved in processing temporal/spatial video data. Conventional methods, uniformly processing entire scenes, often result in inefficient resource allocation. This is evident in the over-processing of simpler regions and insufficient attention to complex regions, leading to edge artifacts in merged regions. Our innovative approach employs semantic segmentation and spatial frequency-based categorization to divide each video frame into regions of varying complexity: simple, medium, and complex. These are then processed through an efficient incremental model, optimizing computational resources. A key innovation is the sparse temporal/spatial feature transformation layer, which mitigates edge artifacts and ensures seamless integration of regional features, enhancing the naturalness of the super-resolution outcome. Experimental results demonstrate that our method significantly boosts VSR efficiency while maintaining effectiveness. This marks a notable advancement in streaming video technology, optimizing video quality with reduced computational demands. This approach, featuring semantic segmentation, spatial frequency analysis, and an incremental network structure, represents a substantial improvement over traditional VSR methodologies, addressing the core challenges of efficiency and quality in high-resolution video streaming.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1007/s00371-024-03554-5
Yi-lun Wang, Yi-zheng Lang, Yun-sheng Qian
Low-light images suffer from low contrast and low dynamic range. However, most existing single-frame low-light image enhancement algorithms are not good enough in terms of detail preservation and color expression and often have high algorithmic complexity. In this paper, we propose a single-frame low-light image fusion enhancement algorithm based on multi-scale contrast–tone mapping and "pixel healthiness" evaluation. It can adaptively adjust the exposure level of each region according to the principal component in the image and enhance contrast while preserving color and detail expression with low computational complexity. In particular, to find the most appropriate size of the artificial image sequence and the target enhancement range for each image, we propose a multi-scale parameter determination method based on the principal component analysis of the V-channel histogram to obtain the best enhancement while reducing unnecessary computations. In addition, a new "pixel healthiness" evaluation method based on global illuminance and local contrast is proposed for fast and efficient computation of weights for image fusion. Subjective evaluation and objective metrics show that our algorithm performs better than existing single-frame image algorithms and other fusion-based algorithms in enhancement, contrast, color expression, and detail preservation.
低照度图像具有低对比度和低动态范围的问题。然而,现有的单帧低照度图像增强算法大多在细节保留和色彩表达方面不够理想,而且算法复杂度往往较高。本文提出了一种基于多尺度对比度映射和 "像素健康度 "评估的单帧弱光图像融合增强算法。它能根据图像中的主成分自适应地调整每个区域的曝光水平,在增强对比度的同时保留色彩和细节表达,且计算复杂度较低。其中,为了找到最合适的人工图像序列大小和每幅图像的目标增强范围,我们提出了一种基于 V 信道直方图主成分分析的多尺度参数确定方法,以获得最佳增强效果,同时减少不必要的计算。此外,我们还提出了一种基于全局照度和局部对比度的新型 "像素健康度 "评估方法,用于快速高效地计算图像融合的权重。主观评价和客观指标表明,我们的算法在增强、对比度、色彩表达和细节保留方面都优于现有的单帧图像算法和其他基于融合的算法。
{"title":"Effective multi-scale enhancement fusion method for low-light images based on interest-area perception OCTM and “pixel healthiness” evaluation","authors":"Yi-lun Wang, Yi-zheng Lang, Yun-sheng Qian","doi":"10.1007/s00371-024-03554-5","DOIUrl":"https://doi.org/10.1007/s00371-024-03554-5","url":null,"abstract":"<p>Low-light images suffer from low contrast and low dynamic range. However, most existing single-frame low-light image enhancement algorithms are not good enough in terms of detail preservation and color expression and often have high algorithmic complexity. In this paper, we propose a single-frame low-light image fusion enhancement algorithm based on multi-scale contrast–tone mapping and \"pixel healthiness\" evaluation. It can adaptively adjust the exposure level of each region according to the principal component in the image and enhance contrast while preserving color and detail expression with low computational complexity. In particular, to find the most appropriate size of the artificial image sequence and the target enhancement range for each image, we propose a multi-scale parameter determination method based on the principal component analysis of the V-channel histogram to obtain the best enhancement while reducing unnecessary computations. In addition, a new \"pixel healthiness\" evaluation method based on global illuminance and local contrast is proposed for fast and efficient computation of weights for image fusion. Subjective evaluation and objective metrics show that our algorithm performs better than existing single-frame image algorithms and other fusion-based algorithms in enhancement, contrast, color expression, and detail preservation.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1007/s00371-024-03551-8
Haobo Dong, Tianyu Song, Xuanyu Qi, Jiyu Jin, Guiyue Jin, Lei Fan
In recent years, Transformer has demonstrated significant performance in single image deraining tasks. However, the standard self-attention in the Transformer makes it difficult to model local features of images effectively. To alleviate the above problem, this paper proposes a high-quality deraining Transformer with effective large kernel attention, named as ELKAformer. The network employs the Transformer-Style Effective Large Kernel Conv-Block (ELKB), which contains 3 key designs: Large Kernel Attention Block (LKAB), Dynamical Enhancement Feed-forward Network (DEFN), and Edge Squeeze Recovery Block (ESRB) to guide the extraction of rich features. To be specific, LKAB introduces convolutional modulation to substitute vanilla self-attention and achieve better local representations. The designed DEFN refines the most valuable attention values in LKAB, allowing the overall design to better preserve pixel-wise information. Additionally, we develop ESRB to obtain long-range dependencies of different positional information. Massive experimental results demonstrate that this method achieves favorable effects while effectively saving computational costs. Our code is available at github
{"title":"Exploring high-quality image deraining Transformer via effective large kernel attention","authors":"Haobo Dong, Tianyu Song, Xuanyu Qi, Jiyu Jin, Guiyue Jin, Lei Fan","doi":"10.1007/s00371-024-03551-8","DOIUrl":"https://doi.org/10.1007/s00371-024-03551-8","url":null,"abstract":"<p>In recent years, Transformer has demonstrated significant performance in single image deraining tasks. However, the standard self-attention in the Transformer makes it difficult to model local features of images effectively. To alleviate the above problem, this paper proposes a high-quality deraining Transformer with <b>e</b>ffective <b>l</b>arge <b>k</b>ernel <b>a</b>ttention, named as ELKAformer. The network employs the Transformer-Style Effective Large Kernel Conv-Block (ELKB), which contains 3 key designs: Large Kernel Attention Block (LKAB), Dynamical Enhancement Feed-forward Network (DEFN), and Edge Squeeze Recovery Block (ESRB) to guide the extraction of rich features. To be specific, LKAB introduces convolutional modulation to substitute vanilla self-attention and achieve better local representations. The designed DEFN refines the most valuable attention values in LKAB, allowing the overall design to better preserve pixel-wise information. Additionally, we develop ESRB to obtain long-range dependencies of different positional information. Massive experimental results demonstrate that this method achieves favorable effects while effectively saving computational costs. Our code is available at github</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1007/s00371-024-03534-9
Randa Elanwar, Margrit Betke
Handwriting synthesis, the task of automatically generating realistic images of handwritten text, has gained increasing attention in recent years, both as a challenge in itself, as well as a task that supports handwriting recognition research. The latter task is to synthesize large image datasets that can then be used to train deep learning models to recognize handwritten text without the need for human-provided annotations. While early attempts at developing handwriting generators yielded limited results [1], more recent works involving generative models of deep neural network architectures have been shown able to produce realistic imitations of human handwriting [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. In this review, we focus on one of the most prevalent and successful architectures in the field of handwriting synthesis, the generative adversarial network (GAN). We describe the capabilities, architecture specifics, and performance of the GAN-based models that have been introduced to the literature since 2019 [2,3,4,5,6,7,8,9,10,11,12,13,14]. These models can generate random handwriting styles, imitate reference styles, and produce realistic images of arbitrary text that was not in the training lexicon. The generated images have been shown to contribute to improving handwriting recognition results when augmenting the training samples of recognition models with synthetic images. The synthetic images were often hard to expose as non-real, even by human examiners, but also could be implausible or style-limited. The review includes a discussion of the characteristics of the GAN architecture in comparison with other paradigms in the image-generation domain and highlights the remaining challenges for handwriting synthesis.
手写合成是一项自动生成逼真手写文本图像的任务,近年来越来越受到关注,它本身既是一项挑战,也是一项支持手写识别研究的任务。后者的任务是合成大型图像数据集,然后用于训练深度学习模型来识别手写文本,而无需人类提供注释。虽然早期开发手写生成器的尝试成果有限[1],但最近涉及深度神经网络架构生成模型的工作已经证明能够生成逼真的人类手写模仿[2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]。在本综述中,我们将重点介绍手写合成领域最流行、最成功的架构之一--生成式对抗网络 (GAN)。我们将介绍自 2019 年以来文献[2,3,4,5,6,7,8,9,10,11,12,13,14]中介绍的基于 GAN 模型的功能、架构细节和性能。这些模型可以生成随机笔迹样式、模仿参考样式,并生成不在训练词典中的任意文本的逼真图像。事实证明,在用合成图像增强识别模型的训练样本时,生成的图像有助于改善手写识别结果。合成图像通常很难被揭示为非真实图像,即使是人类检查员也很难发现,但也可能是不可信的或有风格限制的。这篇综述将 GAN 架构的特点与图像生成领域的其他范例进行了比较讨论,并强调了手写合成仍面临的挑战。
{"title":"Generative adversarial networks for handwriting image generation: a review","authors":"Randa Elanwar, Margrit Betke","doi":"10.1007/s00371-024-03534-9","DOIUrl":"https://doi.org/10.1007/s00371-024-03534-9","url":null,"abstract":"<p>Handwriting synthesis, the task of automatically generating realistic images of handwritten text, has gained increasing attention in recent years, both as a challenge in itself, as well as a task that supports handwriting recognition research. The latter task is to synthesize large image datasets that can then be used to train deep learning models to recognize handwritten text without the need for human-provided annotations. While early attempts at developing handwriting generators yielded limited results [1], more recent works involving generative models of deep neural network architectures have been shown able to produce realistic imitations of human handwriting [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. In this review, we focus on one of the most prevalent and successful architectures in the field of handwriting synthesis, the generative adversarial network (GAN). We describe the capabilities, architecture specifics, and performance of the GAN-based models that have been introduced to the literature since 2019 [2,3,4,5,6,7,8,9,10,11,12,13,14]. These models can generate random handwriting styles, imitate reference styles, and produce realistic images of arbitrary text that was not in the training lexicon. The generated images have been shown to contribute to improving handwriting recognition results when augmenting the training samples of recognition models with synthetic images. The synthetic images were often hard to expose as non-real, even by human examiners, but also could be implausible or style-limited. The review includes a discussion of the characteristics of the GAN architecture in comparison with other paradigms in the image-generation domain and highlights the remaining challenges for handwriting synthesis.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"61 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1007/s00371-024-03521-0
Max Reimann, Martin Büßemeyer, Benito Buchheim, Amir Semmo, Jürgen Döllner, Matthias Trapp
While methods for generative image synthesis and example-based stylization produce impressive results, their black-box style representation intertwines shape, texture, and color aspects, limiting precise stylistic control and editing of artistic images. We introduce a novel method for decomposing the style of an artistic image that enables interactive geometric shape abstraction and texture control. We spatially decompose the input image into geometric shapes and an overlaying parametric texture representation, facilitating independent manipulation of color and texture. The parameters in this texture representation, comprising the image’s high-frequency details, control painterly attributes in a series of differentiable stylization filters. Shape decomposition is achieved using either segmentation or stroke-based neural rendering techniques. We demonstrate that our shape and texture decoupling enables diverse stylistic edits, including adjustments in shape, stroke, and painterly attributes such as contours and surface relief. Moreover, we demonstrate shape and texture style transfer in the parametric space using both reference images and text prompts and accelerate these by training networks for single- and arbitrary-style parameter prediction.
{"title":"Artistic style decomposition for texture and shape editing","authors":"Max Reimann, Martin Büßemeyer, Benito Buchheim, Amir Semmo, Jürgen Döllner, Matthias Trapp","doi":"10.1007/s00371-024-03521-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03521-0","url":null,"abstract":"<p>While methods for generative image synthesis and example-based stylization produce impressive results, their black-box style representation intertwines shape, texture, and color aspects, limiting precise stylistic control and editing of artistic images. We introduce a novel method for decomposing the style of an artistic image that enables interactive geometric shape abstraction and texture control. We spatially decompose the input image into geometric shapes and an overlaying parametric texture representation, facilitating independent manipulation of color and texture. The parameters in this texture representation, comprising the image’s high-frequency details, control painterly attributes in a series of differentiable stylization filters. Shape decomposition is achieved using either segmentation or stroke-based neural rendering techniques. We demonstrate that our shape and texture decoupling enables diverse stylistic edits, including adjustments in shape, stroke, and painterly attributes such as contours and surface relief. Moreover, we demonstrate shape and texture style transfer in the parametric space using both reference images and text prompts and accelerate these by training networks for single- and arbitrary-style parameter prediction.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1007/s00371-024-03527-8
Mohamed Elsayed, Mohamed Reda, Ahmed S. Mashaly, Ahmed S. Amein
Recently, the world has witnessed a great increase in drone applications and missions. Drones must be detected quickly, effectively, and precisely when they are being handled illegally. Vision-based anti-drone systems provide an efficient performance compared to radar- and acoustic-based systems. The effectiveness of drone detection is affected by a number of issues, including the drone’s small size, conflicts with other objects, and noisy backgrounds. This paper employs enlarging the effective receptive field (ERF) of feature maps generated from the YOLOv6 backbone. First, RepLKNet is used as the backbone of YOLOv6, which deploys large kernels with depth-wise convolution. Then, to get beyond RepLKNet’s large inference time, a novel LERFNet is implemented. LERFNet uses dilated convolution in addition to large kernels to enlarge the ERF and overcome each other’s problems. The linear spatial-channel attention module (LAM) is used to give more attention to the most informative pixels and high feature channels. LERFNet produces output feature maps with a large ERF and high shape bias to enhance the detection of various drone sizes in complex scenes. The RepLKNet and LERFNet backbones for Tiny-YOLOv6, Tiny-YOLOv6, YOLOv5s, and Tiny-YOLOv7 are compared. In comparison to the aforementioned techniques, the suggested model’s results show a greater balance between accuracy and speed. LERFNet increases the MAP by (2.8%), while significantly reducing the GFLOPs and parameter numbers when compared to the original backbone of YOLOv6.
{"title":"LERFNet: an enlarged effective receptive field backbone network for enhancing visual drone detection","authors":"Mohamed Elsayed, Mohamed Reda, Ahmed S. Mashaly, Ahmed S. Amein","doi":"10.1007/s00371-024-03527-8","DOIUrl":"https://doi.org/10.1007/s00371-024-03527-8","url":null,"abstract":"<p>Recently, the world has witnessed a great increase in drone applications and missions. Drones must be detected quickly, effectively, and precisely when they are being handled illegally. Vision-based anti-drone systems provide an efficient performance compared to radar- and acoustic-based systems. The effectiveness of drone detection is affected by a number of issues, including the drone’s small size, conflicts with other objects, and noisy backgrounds. This paper employs enlarging the effective receptive field (ERF) of feature maps generated from the YOLOv6 backbone. First, RepLKNet is used as the backbone of YOLOv6, which deploys large kernels with depth-wise convolution. Then, to get beyond RepLKNet’s large inference time, a novel LERFNet is implemented. LERFNet uses dilated convolution in addition to large kernels to enlarge the ERF and overcome each other’s problems. The linear spatial-channel attention module (LAM) is used to give more attention to the most informative pixels and high feature channels. LERFNet produces output feature maps with a large ERF and high shape bias to enhance the detection of various drone sizes in complex scenes. The RepLKNet and LERFNet backbones for Tiny-YOLOv6, Tiny-YOLOv6, YOLOv5s, and Tiny-YOLOv7 are compared. In comparison to the aforementioned techniques, the suggested model’s results show a greater balance between accuracy and speed. LERFNet increases the MAP by <span>(2.8%)</span>, while significantly reducing the GFLOPs and parameter numbers when compared to the original backbone of YOLOv6.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141529899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}