Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.
{"title":"SES-ReNet: Lightweight deep learning model for human detection in hazy weather conditions","authors":"Yassine Bouafia , Mohand Saïd Allili , Loucif Hebbache , Larbi Guezouli","doi":"10.1016/j.image.2024.117223","DOIUrl":"10.1016/j.image.2024.117223","url":null,"abstract":"<div><div>Accurate detection of people in outdoor scenes plays an essential role in improving personal safety and security. However, existing human detection algorithms face significant challenges when visibility is reduced and human appearance is degraded, particularly in hazy weather conditions. To address this problem, we present a novel lightweight model based on the RetinaNet detection architecture. The model incorporates a lightweight backbone feature extractor, a dehazing functionality based on knowledge distillation (KD), and a multi-scale attention mechanism based on the Squeeze and Excitation (SE) principle. KD is achieved from a larger network trained on unhazed clear images, whereas attention is incorporated at low-level and high-level features of the network. Experimental results have shown remarkable performance, outperforming state-of-the-art methods while running at 22 FPS. The combination of high accuracy and real-time capabilities makes our approach a promising solution for effective human detection in challenging weather conditions and suitable for real-time applications.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117223"},"PeriodicalIF":3.4,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1016/j.image.2024.117224
Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu
Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.
{"title":"HOI-V: One-stage human-object interaction detection based on multi-feature fusion in videos","authors":"Dongzhou Gu , Kaihua Huang , Shiwei Ma , Jiang Liu","doi":"10.1016/j.image.2024.117224","DOIUrl":"10.1016/j.image.2024.117224","url":null,"abstract":"<div><div>Effective detection of Human-Object Interaction (HOI) is important for machine understanding of real-world scenarios. Nowadays, image-based HOI detection has been abundantly investigated, and recent one-stage methods strike a balance between accuracy and efficiency. However, it is difficult to predict temporal-aware interaction actions from static images since limited temporal context information is introduced. Meanwhile, due to the lack of early large-scale video HOI datasets and the high computational cost of spatial-temporal HOI model training, recent exploratory studies mostly follow a two-stage paradigm, but independent object detection and interaction recognition still suffer from computational redundancy and independent optimization. Therefore, inspired by the one-stage interaction point detection framework, a one-stage spatial-temporal HOI detection baseline is proposed in this paper, in which the short-term local motion features and long-term temporal context features are obtained by the proposed temporal differential excitation module (TDEM) and DLA-TSM backbone. Complementary visual features between multiple clips are then extracted by multi-feature fusion and fed into the parallel detection branches. Finally, a video dataset containing only actions with reduced data size (HOI-V) is constructed to motivate further research on end-to-end video HOI detection. Extensive experiments are also conducted to verify the validity of our proposed baseline.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117224"},"PeriodicalIF":3.4,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent learning based neural image compression methods have achieved impressive rate–distortion (RD) performance via the sophisticated context entropy model, which performs well in capturing the spatial correlations of latent features. However, due to the dependency on the adjacent or distant decoded features, existing methods require an inefficient serial processing structure, which significantly limits its practicability. Instead of pursuing computationally expensive entropy estimation, we propose to reduce the spatial redundancy via the channel-wise scale adaptive latent representation learning, whose entropy coding is spatially context-free and parallelizable. Specifically, the proposed encoder adaptively determines the scale of the latent features via a learnable binary mask, which is optimized with the RD cost. In this way, lower-scale latent representation will be allocated to the channels with higher spatial redundancy, which consumes fewer bits and vice versa. The downscaled latent features could be well recovered with a lightweight inter-channel upconversion module in the decoder. To compensate for the entropy estimation performance degradation, we further develop an inter-scale hyperprior entropy model, which supports the high efficiency parallel encoding/decoding within each scale of the latent features. Extensive experiments are conducted to illustrate the efficacy of the proposed method. Our method achieves bitrate savings of 18.23%, 19.36%, and 27.04% over HEVC Intra, along with decoding speeds that are 46 times, 48 times, and 51 times faster than the baseline method on the Kodak, Tecnick, and CLIC datasets, respectively.
{"title":"High efficiency deep image compression via channel-wise scale adaptive latent representation learning","authors":"Chenhao Wu, Qingbo Wu, King Ngi Ngan, Hongliang Li, Fanman Meng, Linfeng Xu","doi":"10.1016/j.image.2024.117227","DOIUrl":"10.1016/j.image.2024.117227","url":null,"abstract":"<div><div>Recent learning based neural image compression methods have achieved impressive rate–distortion (RD) performance via the sophisticated context entropy model, which performs well in capturing the spatial correlations of latent features. However, due to the dependency on the adjacent or distant decoded features, existing methods require an inefficient serial processing structure, which significantly limits its practicability. Instead of pursuing computationally expensive entropy estimation, we propose to reduce the spatial redundancy via the channel-wise scale adaptive latent representation learning, whose entropy coding is spatially context-free and parallelizable. Specifically, the proposed encoder adaptively determines the scale of the latent features via a learnable binary mask, which is optimized with the RD cost. In this way, lower-scale latent representation will be allocated to the channels with higher spatial redundancy, which consumes fewer bits and vice versa. The downscaled latent features could be well recovered with a lightweight inter-channel upconversion module in the decoder. To compensate for the entropy estimation performance degradation, we further develop an inter-scale hyperprior entropy model, which supports the high efficiency parallel encoding/decoding within each scale of the latent features. Extensive experiments are conducted to illustrate the efficacy of the proposed method. Our method achieves bitrate savings of 18.23%, 19.36%, and 27.04% over HEVC Intra, along with decoding speeds that are 46 times, 48 times, and 51 times faster than the baseline method on the Kodak, Tecnick, and CLIC datasets, respectively.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117227"},"PeriodicalIF":3.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.image.2024.117222
Che-Tsung Lin , Chun Chet Ng , Zhi Qin Tan , Wan Jun Nah , Xinyu Wang , Jie Long Kew , Pohao Hsu , Shang Hong Lai , Chee Seng Chan , Christopher Zach
Extremely low-light text images pose significant challenges for scene text detection. Existing methods enhance these images using low-light image enhancement techniques before text detection. However, they fail to address the importance of low-level features, which are essential for optimal performance in downstream scene text tasks. Further research is also limited by the scarcity of extremely low-light text datasets. To address these limitations, we propose a novel, text-aware extremely low-light image enhancement framework. Our approach first integrates a Text-Aware Copy-Paste (Text-CP) augmentation method as a preprocessing step, followed by a dual-encoder–decoder architecture enhanced with Edge-Aware attention modules. We also introduce text detection and edge reconstruction losses to train the model to generate images with higher text visibility. Additionally, we propose a Supervised Deep Curve Estimation (Supervised-DCE) model for synthesizing extremely low-light images, allowing training on publicly available scene text datasets such as IC15. To further advance this domain, we annotated texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets. The proposed framework is rigorously tested against various traditional and deep learning-based methods on the newly labeled SID-Sony-Text, SID-Fuji-Text, LOL-Text, and synthetic extremely low-light IC15 datasets. Our extensive experiments demonstrate notable improvements in both image enhancement and scene text tasks, showcasing the model’s efficacy in text detection under extremely low-light conditions. Code and datasets will be released publicly at https://github.com/chunchet-ng/Text-in-the-Dark.
极低照度下的文本图像给场景文本检测带来了巨大挑战。现有方法在文本检测前使用低照度图像增强技术来增强这些图像。然而,这些方法未能解决低层次特征的重要性问题,而低层次特征对于优化下游场景文本任务的性能至关重要。此外,极低照度文本数据集的缺乏也限制了进一步的研究。为了解决这些局限性,我们提出了一种新颖的、文本感知的极低照度图像增强框架。我们的方法首先集成了文本感知复制-粘贴(Text-CP)增强方法作为预处理步骤,然后采用边缘感知注意模块增强的双编码器-解码器架构。我们还引入了文本检测和边缘重建损失,以训练模型生成具有更高文本可见性的图像。此外,我们还提出了一种用于合成极低光图像的监督深度曲线估计(Supervised-DCE)模型,允许在 IC15 等公开可用的场景文本数据集上进行训练。为了进一步推动这一领域的发展,我们在极低照度的 "See In the Dark"(SID)和普通的 "LOw-Light"(LOL)数据集中对文本进行了注释。在新标注的 SID-Sony-文本、SID-Fuji-文本、LOL-文本和合成的极低照度 IC15 数据集上,我们针对各种传统方法和基于深度学习的方法对所提出的框架进行了严格测试。我们的大量实验表明,该模型在图像增强和场景文本任务方面都有显著改进,展示了该模型在极弱光条件下进行文本检测的功效。代码和数据集将在 https://github.com/chunchet-ng/Text-in-the-Dark 上公开发布。
{"title":"Text in the dark: Extremely low-light text image enhancement","authors":"Che-Tsung Lin , Chun Chet Ng , Zhi Qin Tan , Wan Jun Nah , Xinyu Wang , Jie Long Kew , Pohao Hsu , Shang Hong Lai , Chee Seng Chan , Christopher Zach","doi":"10.1016/j.image.2024.117222","DOIUrl":"10.1016/j.image.2024.117222","url":null,"abstract":"<div><div>Extremely low-light text images pose significant challenges for scene text detection. Existing methods enhance these images using low-light image enhancement techniques before text detection. However, they fail to address the importance of low-level features, which are essential for optimal performance in downstream scene text tasks. Further research is also limited by the scarcity of extremely low-light text datasets. To address these limitations, we propose a novel, text-aware extremely low-light image enhancement framework. Our approach first integrates a Text-Aware Copy-Paste (Text-CP) augmentation method as a preprocessing step, followed by a dual-encoder–decoder architecture enhanced with Edge-Aware attention modules. We also introduce text detection and edge reconstruction losses to train the model to generate images with higher text visibility. Additionally, we propose a Supervised Deep Curve Estimation (Supervised-DCE) model for synthesizing extremely low-light images, allowing training on publicly available scene text datasets such as IC15. To further advance this domain, we annotated texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets. The proposed framework is rigorously tested against various traditional and deep learning-based methods on the newly labeled SID-Sony-Text, SID-Fuji-Text, LOL-Text, and synthetic extremely low-light IC15 datasets. Our extensive experiments demonstrate notable improvements in both image enhancement and scene text tasks, showcasing the model’s efficacy in text detection under extremely low-light conditions. Code and datasets will be released publicly at <span><span>https://github.com/chunchet-ng/Text-in-the-Dark</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117222"},"PeriodicalIF":3.4,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.image.2024.117226
Hanyang Wan, Ruoyun Liu, Li Yu
Scene text detection and recognition currently stand as prominent research areas in computer vision, boasting a broad spectrum of potential applications in fields such as intelligent driving and automated production. Existing mainstream methodologies, however, suffer from notable deficiencies including incomplete text region detection, excessive background noise, and a neglect of simultaneous global information and contextual dependencies. In this study, we introduce BMINet, an innovative scene text detection approach based on boundary fitting, paired with a double-supervised scene text recognition method that incorporates text region correction. The BMINet framework is primarily structured around a boundary fitting module and a multi-scale fusion module. The boundary fitting module samples a specific number of control points equidistantly along the predicted boundary and adjusts their positions to better align the detection box with the text shape. The multi-scale fusion module integrates information from multi-scale feature maps to expand the network’s receptive field. The double-supervised scene text recognition method, incorporating text region correction, integrates the image processing modules for rotating rectangle boxes and binary image segmentation. Additionally, it introduces a correction network to refine text region boundaries. This method integrates recognition techniques based on CTC loss and attention mechanisms, emphasizing texture details and contextual dependencies in text images to enhance network performance through dual supervision. Extensive ablation and comparison experiments confirm the efficacy of the two-stage model in achieving robust detection and recognition outcomes, achieving a recognition accuracy of 80.6% on the Total-Text dataset.
{"title":"Double supervision for scene text detection and recognition based on BMINet","authors":"Hanyang Wan, Ruoyun Liu, Li Yu","doi":"10.1016/j.image.2024.117226","DOIUrl":"10.1016/j.image.2024.117226","url":null,"abstract":"<div><div>Scene text detection and recognition currently stand as prominent research areas in computer vision, boasting a broad spectrum of potential applications in fields such as intelligent driving and automated production. Existing mainstream methodologies, however, suffer from notable deficiencies including incomplete text region detection, excessive background noise, and a neglect of simultaneous global information and contextual dependencies. In this study, we introduce BMINet, an innovative scene text detection approach based on boundary fitting, paired with a double-supervised scene text recognition method that incorporates text region correction. The BMINet framework is primarily structured around a boundary fitting module and a multi-scale fusion module. The boundary fitting module samples a specific number of control points equidistantly along the predicted boundary and adjusts their positions to better align the detection box with the text shape. The multi-scale fusion module integrates information from multi-scale feature maps to expand the network’s receptive field. The double-supervised scene text recognition method, incorporating text region correction, integrates the image processing modules for rotating rectangle boxes and binary image segmentation. Additionally, it introduces a correction network to refine text region boundaries. This method integrates recognition techniques based on CTC loss and attention mechanisms, emphasizing texture details and contextual dependencies in text images to enhance network performance through dual supervision. Extensive ablation and comparison experiments confirm the efficacy of the two-stage model in achieving robust detection and recognition outcomes, achieving a recognition accuracy of 80.6% on the Total-Text dataset.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117226"},"PeriodicalIF":3.4,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.image.2024.117229
Hegui Zhu , Luyang Wang , Zhan Gao , Yuelin Liu , Qian Zhao
Low-light image enhancement is a very challenging subject in the field of computer vision such as visual surveillance, driving behavior analysis, and medical imaging . It has a large number of degradation problems such as accumulated noise, artifacts, and color distortion. Therefore, how to solve the degradation problems and obtain clear images with high visual quality has become an important issue. It can effectively improve the performance of high-level computer vision tasks. In this study, we propose a new two-stage low-light enhancement network with a progressive attention fusion strategy, and the two hallmarks of this method are the use of global feature fusion (GFF) and local detail restoration (LDR), which can enrich the global content of the image and restore local details. Experimental results on the LOL dataset show that the proposed model can achieve good enhancement effects. Moreover, on the benchmark dataset without reference images, the proposed model also obtains a better NIQE score, which outperforms most existing state-of-the-art methods in both quantitative and qualitative evaluations. All these verify the effectiveness and superiority of the proposed method.
{"title":"A new two-stage low-light enhancement network with progressive attention fusion strategy","authors":"Hegui Zhu , Luyang Wang , Zhan Gao , Yuelin Liu , Qian Zhao","doi":"10.1016/j.image.2024.117229","DOIUrl":"10.1016/j.image.2024.117229","url":null,"abstract":"<div><div>Low-light image enhancement is a very challenging subject in the field of computer vision such as visual surveillance, driving behavior analysis, and medical imaging . It has a large number of degradation problems such as accumulated noise, artifacts, and color distortion. Therefore, how to solve the degradation problems and obtain clear images with high visual quality has become an important issue. It can effectively improve the performance of high-level computer vision tasks. In this study, we propose a new two-stage low-light enhancement network with a progressive attention fusion strategy, and the two hallmarks of this method are the use of global feature fusion (GFF) and local detail restoration (LDR), which can enrich the global content of the image and restore local details. Experimental results on the LOL dataset show that the proposed model can achieve good enhancement effects. Moreover, on the benchmark dataset without reference images, the proposed model also obtains a better NIQE score, which outperforms most existing state-of-the-art methods in both quantitative and qualitative evaluations. All these verify the effectiveness and superiority of the proposed method.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117229"},"PeriodicalIF":3.4,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142579043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-22DOI: 10.1016/j.image.2024.117228
Yueying Luo, Kangjian He, Dan Xu, Hongzhen Shi, Wenxia Yin
Effectively fusing infrared and visible images enhances the visibility of infrared target information while capturing visual details. Balancing the brightness and contrast of the fusion image adequately has posed a significant challenge. Moreover, preserving detailed information in fusion images has been problematic. To address these issues, this paper proposes a fusion algorithm based on multi-scale decomposition and adaptive contrast enhancement. Initially, we present a hybrid multi-scale decomposition method aimed at extracting valuable information comprehensively from the source image. Subsequently, we advance an adaptive base layer optimization approach to regulate the brightness and contrast of the resultant fusion image. Lastly, we design a weight mapping rule grounded in saliency detection to integrate small-scale layers, thereby conserving the edge structure within the fusion outcome. Both qualitative and quantitative experimental results affirm the superiority of the proposed method over 11 state-of-the-art image fusion methods. Our method excels in preserving more texture and achieving higher contrast, which proves advantageous for monitoring tasks.
{"title":"Infrared and visible image fusion based on hybrid multi-scale decomposition and adaptive contrast enhancement","authors":"Yueying Luo, Kangjian He, Dan Xu, Hongzhen Shi, Wenxia Yin","doi":"10.1016/j.image.2024.117228","DOIUrl":"10.1016/j.image.2024.117228","url":null,"abstract":"<div><div>Effectively fusing infrared and visible images enhances the visibility of infrared target information while capturing visual details. Balancing the brightness and contrast of the fusion image adequately has posed a significant challenge. Moreover, preserving detailed information in fusion images has been problematic. To address these issues, this paper proposes a fusion algorithm based on multi-scale decomposition and adaptive contrast enhancement. Initially, we present a hybrid multi-scale decomposition method aimed at extracting valuable information comprehensively from the source image. Subsequently, we advance an adaptive base layer optimization approach to regulate the brightness and contrast of the resultant fusion image. Lastly, we design a weight mapping rule grounded in saliency detection to integrate small-scale layers, thereby conserving the edge structure within the fusion outcome. Both qualitative and quantitative experimental results affirm the superiority of the proposed method over 11 state-of-the-art image fusion methods. Our method excels in preserving more texture and achieving higher contrast, which proves advantageous for monitoring tasks.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117228"},"PeriodicalIF":3.4,"publicationDate":"2024-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-30DOI: 10.1016/j.image.2024.117214
Dajian Zhong , Shivakumara Palaiahnakote , Umapada Pal , Yue Lu
Unlike objective type evaluation, descriptive answer evaluation is challenging due to unpredictable answers and free writing style of answers. Because of these, descriptive answer evaluation has received special attention from many researchers. Automatic answer evaluation is useful for the following situations. It can avoid human intervention for marking, eliminates bias marking and most important is that it can save huge manpower. To develop an efficient and accurate system, there are several open challenges. One such open challenge is cleaning the document, which includes struck-out words removal and restoring the struck-out words. In this paper, we have proposed a system for struck-out handwritten word detection and restoration for automatic descriptive answer evaluation. The work has two stages. In the first stage, we explore the combination of ResNet50 and the diagonal line (principal and secondary diagonal lines) segmentation module for detecting words and then classifying struck-out words using a classification network. In the second stage, we explore the combination of U-Net as a backbone and Bi-LSTM for predicting pixels that represent actual text information of the struck-out words based on the relationship between sequences of pixels for restoration. Experimental results on our dataset and standard datasets show that the proposed model is impressive for struck-out word detection and restoration. A comparative study with the state-of-the-art methods shows that the proposed approach outperforms the existing models in terms of struck-out word detection and restoration.
{"title":"Struck-out handwritten word detection and restoration for automatic descriptive answer evaluation","authors":"Dajian Zhong , Shivakumara Palaiahnakote , Umapada Pal , Yue Lu","doi":"10.1016/j.image.2024.117214","DOIUrl":"10.1016/j.image.2024.117214","url":null,"abstract":"<div><div>Unlike objective type evaluation, descriptive answer evaluation is challenging due to unpredictable answers and free writing style of answers. Because of these, descriptive answer evaluation has received special attention from many researchers. Automatic answer evaluation is useful for the following situations. It can avoid human intervention for marking, eliminates bias marking and most important is that it can save huge manpower. To develop an efficient and accurate system, there are several open challenges. One such open challenge is cleaning the document, which includes struck-out words removal and restoring the struck-out words. In this paper, we have proposed a system for struck-out handwritten word detection and restoration for automatic descriptive answer evaluation. The work has two stages. In the first stage, we explore the combination of ResNet50 and the diagonal line (principal and secondary diagonal lines) segmentation module for detecting words and then classifying struck-out words using a classification network. In the second stage, we explore the combination of U-Net as a backbone and Bi-LSTM for predicting pixels that represent actual text information of the struck-out words based on the relationship between sequences of pixels for restoration. Experimental results on our dataset and standard datasets show that the proposed model is impressive for struck-out word detection and restoration. A comparative study with the state-of-the-art methods shows that the proposed approach outperforms the existing models in terms of struck-out word detection and restoration.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117214"},"PeriodicalIF":3.4,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142416777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-23DOI: 10.1016/j.image.2024.117212
Paolo Giannitrapani, Elio D. Di Claudio , Giovanni Jacovitti
Objective Image Quality Assessment (IQA) methods often lack of linearity of their quality estimates with respect to scores expressed by human subjects and therefore IQA metrics undergo a calibration process based on subjective quality examples. However, example-based training presents a challenge in terms of generalization hampering result comparison across different applications and operative conditions. In this paper, new Full Reference (FR) techniques, providing estimates linearly correlated with human scores without using calibration are introduced. We show that on natural images, application of estimation theory and psychophysical principles to images degraded by Gaussian blur leads to a so-called canonical IQA method, whose estimates are linearly correlated to both the subjective scores and the viewing distance. Then, we show that any mainstream IQA methods can be reconducted to the canonical method by converting its metric based on a unique specimen image. The proposed scheme is extended to wide classes of degraded images, e.g. noisy and compressed images. The resulting calibration-free FR IQA methods allows for comparability and interoperability across different imaging systems and on different viewing distances. A comparison of their statistical performance with respect to state-of-the-art calibration prone methods is finally provided, showing that the presented model is a valid alternative to the final 5-parameter calibration step of IQA methods, and the two parameters of the model have a clear operational meaning and are simply determined in practical applications. The enhanced performance are achieved across multiple viewing distance databases by independently realigning the blur values associated with each distance.
{"title":"Full-reference calibration-free image quality assessment","authors":"Paolo Giannitrapani, Elio D. Di Claudio , Giovanni Jacovitti","doi":"10.1016/j.image.2024.117212","DOIUrl":"10.1016/j.image.2024.117212","url":null,"abstract":"<div><div>Objective Image Quality Assessment (IQA) methods often lack of linearity of their quality estimates with respect to scores expressed by human subjects and therefore IQA metrics undergo a calibration process based on subjective quality examples. However, example-based training presents a challenge in terms of generalization hampering result comparison across different applications and operative conditions. In this paper, new Full Reference (FR) techniques, providing estimates linearly correlated with human scores without using calibration are introduced. We show that on natural images, application of estimation theory and psychophysical principles to images degraded by Gaussian blur leads to a so-called canonical IQA method, whose estimates are linearly correlated to both the subjective scores and the viewing distance. Then, we show that any mainstream IQA methods can be reconducted to the canonical method by converting its metric based on a unique specimen image. The proposed scheme is extended to wide classes of degraded images, e.g. noisy and compressed images. The resulting calibration-free FR IQA methods allows for comparability and interoperability across different imaging systems and on different viewing distances. A comparison of their statistical performance with respect to state-of-the-art calibration prone methods is finally provided, showing that the presented model is a valid alternative to the final 5-parameter calibration step of IQA methods, and the two parameters of the model have a clear operational meaning and are simply determined in practical applications. The enhanced performance are achieved across multiple viewing distance databases by independently realigning the blur values associated with each distance.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117212"},"PeriodicalIF":3.4,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142328081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-19DOI: 10.1016/j.image.2024.117213
Sidi He , Chengfang Zhang , Haoyue Li , Ziliang Feng
Multi-focus image fusion merges multiple images captured from different focused regions of a scene to create a fully-focused image. Convolutional sparse coding (CSC) methods are commonly employed for accurate extraction of focused regions, but they often disregard computational costs. To overcome this, an online convolutional sparse coding (OCSC) technique was introduced, but its performance is still limited by the number of filters used, affecting overall performance negatively. To address these limitations, a novel approach called Sample-Dependent Dictionary-based Online Convolutional Sparse Coding (SCSC) was proposed. SCSC enables the utilization of additional filters while maintaining low time and space complexity for processing high-dimensional or large data. Leveraging the computational efficiency and effective global feature extraction of SCSC, we propose a novel method for multi-focus image fusion. Our method involves a two-layer decomposition of each source image, yielding a base layer capturing the predominant features and a detail layer containing finer details. The amalgamation of the fused base and detail layers culminates in the reconstruction of the final image. The proposed method significantly mitigates artifacts, preserves fine details at the focus boundary, and demonstrates notable enhancements in both visual quality and objective evaluation of multi-focus image fusion.
{"title":"Improved multi-focus image fusion using online convolutional sparse coding based on sample-dependent dictionary","authors":"Sidi He , Chengfang Zhang , Haoyue Li , Ziliang Feng","doi":"10.1016/j.image.2024.117213","DOIUrl":"10.1016/j.image.2024.117213","url":null,"abstract":"<div><div>Multi-focus image fusion merges multiple images captured from different focused regions of a scene to create a fully-focused image. Convolutional sparse coding (CSC) methods are commonly employed for accurate extraction of focused regions, but they often disregard computational costs. To overcome this, an online convolutional sparse coding (OCSC) technique was introduced, but its performance is still limited by the number of filters used, affecting overall performance negatively. To address these limitations, a novel approach called Sample-Dependent Dictionary-based Online Convolutional Sparse Coding (SCSC) was proposed. SCSC enables the utilization of additional filters while maintaining low time and space complexity for processing high-dimensional or large data. Leveraging the computational efficiency and effective global feature extraction of SCSC, we propose a novel method for multi-focus image fusion. Our method involves a two-layer decomposition of each source image, yielding a base layer capturing the predominant features and a detail layer containing finer details. The amalgamation of the fused base and detail layers culminates in the reconstruction of the final image. The proposed method significantly mitigates artifacts, preserves fine details at the focus boundary, and demonstrates notable enhancements in both visual quality and objective evaluation of multi-focus image fusion.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117213"},"PeriodicalIF":3.4,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142311314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}