Pub Date : 2024-04-03DOI: 10.1007/s10044-024-01252-5
Abstract
Deep learning algorithms have gained widespread usage in defect detection systems. However, existing methods are not satisfied for large-scale applications on surface defect detection of strip steel. In this paper, we propose a precise and efficient detection model, named CABF-YOLO, based on the YOLOX for strip steel surface defects. Firstly, we introduce the Triplet Convolutional Coordinate Attention (TCCA) module in the backbone of the YOLOX. By factorizing the pooling operation, the TCCA module can accurately capture cross-channel features to identify the location information of defects. Secondly, we design a novel Bidirectional Fusion (BF) strategy in the neck of the YOLOX. The BF strategy enhances the fusion of low-level and high-level semantic information to obtain fine-grained information. Lastly, the original bounding box loss function is replaced by the EIoU loss function. In the EIoU loss function, the penalty term is redefined to consider the overlap area, central point, and side length of the required regressions to accelerate the convergence rate and localization accuracy. On the benchmark NEU-DET dataset and GC10-DET dataset, the experimental results show that the CABF-YOLO achieves superior performance compared with other comparison models and satisfies the real-time detection requirement of industrial production.
{"title":"CABF-YOLO: a precise and efficient deep learning method for defect detection on strip steel surface","authors":"","doi":"10.1007/s10044-024-01252-5","DOIUrl":"https://doi.org/10.1007/s10044-024-01252-5","url":null,"abstract":"<h3>Abstract</h3> <p>Deep learning algorithms have gained widespread usage in defect detection systems. However, existing methods are not satisfied for large-scale applications on surface defect detection of strip steel. In this paper, we propose a precise and efficient detection model, named CABF-YOLO, based on the YOLOX for strip steel surface defects. Firstly, we introduce the Triplet Convolutional Coordinate Attention (TCCA) module in the backbone of the YOLOX. By factorizing the pooling operation, the TCCA module can accurately capture cross-channel features to identify the location information of defects. Secondly, we design a novel Bidirectional Fusion (BF) strategy in the neck of the YOLOX. The BF strategy enhances the fusion of low-level and high-level semantic information to obtain fine-grained information. Lastly, the original bounding box loss function is replaced by the EIoU loss function. In the EIoU loss function, the penalty term is redefined to consider the overlap area, central point, and side length of the required regressions to accelerate the convergence rate and localization accuracy. On the benchmark NEU-DET dataset and GC10-DET dataset, the experimental results show that the CABF-YOLO achieves superior performance compared with other comparison models and satisfies the real-time detection requirement of industrial production.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"49 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140585401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-02DOI: 10.1007/s10044-024-01251-6
Yanbo Feng, Adel Hafiane, Hélène Laurent
The segmentation of pathological image is an indispensable content in the cancerous diagnosis and grading, which is provided to doctors for the location and quantitative analysis of pathologically altered tissue. However, pathological whole slide image (WSI) generally has gigapixel size and huge region-level objective to be segmented. Extracting patches from WSI can address the limitation of computer memory, but the integrity of target is hence affected. Moreover, supervised learning methods require manually annotated labels for training, which is laborious and time-consuming. Thus, we studied a novel weakly supervised learning (WSL)-based end-to-end framework for semantic segmentation of cancerous area in WSI. The proposed framework is based on the block-level segmentation of convolutional neural network (CNN), while CNN is required to integrate the global average pooling layer and single fully connected layer as WSL-CNN. Class activation map and dense conditional random field (DenseCRF) are adapted to realize pixel-level segmentation of the cancerous area in patch, which is incorporated into the classification process of WSL-CNN. The hierarchically double use of DenseCRF effectively improves the precision of semantic segmentation. A region-based annotation method and a flexible method of constructing training dataset are proposed to reduce the workload of annotation. Experiments show that the block-level segmentation of CNNs has better performance than the pixel-level segmentation of fully convolutional networks, ResNet50 is the best one that achieves F1 score of 0.87426, Jaccard score of 0.78079, Recall of 0.94251 and Precision of 0.82182. The proposed framework can effectively refine the block-level prediction as semantic segmentation without pixel-level label. The precision of all tested CNNs get improved in the experiments, with WSL-ResNet50 achieving F1 score of 0.90630, Jaccard score of 0.83230, Recall of 0.92051 and Precision of 0.89789. We propose a complete end-to-end framework, including the specific structure of neural network, the construction of training dataset, the prediction method using neural network and the post-processing. CNN-like architectures can be widely transplanted into this framework to realize semantic segmentation, solving the problem of insufficient label of large-scale medical image to a certain extent.
病理图像分割是癌症诊断和分级中不可或缺的内容,可为医生提供病理改变组织的定位和定量分析。然而,病理全切片图像(WSI)一般都有千兆像素大小,需要分割的区域级目标巨大。从 WSI 提取斑块可以解决计算机内存的限制,但目标的完整性会因此受到影响。此外,监督学习方法需要人工标注标签进行训练,费时费力。因此,我们研究了一种基于弱监督学习(WSL)的新型端到端框架,用于在 WSI 中对癌症区域进行语义分割。所提出的框架基于卷积神经网络(CNN)的块级分割,而 CNN 则需要整合全局平均池化层和单个全连接层作为 WSL-CNN。类激活图和稠密条件随机场(DenseCRF)可实现像素级的癌斑分割,并将其纳入 WSL-CNN 的分类过程。DenseCRF 的分层双重使用有效提高了语义分割的精度。提出了基于区域的标注方法和灵活的训练数据集构建方法,以减少标注工作量。实验表明,CNN 的块级分割比全卷积网络的像素级分割性能更好,其中 ResNet50 的 F1 得分为 0.87426,Jaccard 得分为 0.78079,Recall 为 0.94251,Precision 为 0.82182。在没有像素级标签的情况下,所提出的框架可以有效地细化块级预测作为语义分割。在实验中,所有测试的 CNN 的精度都得到了提高,WSL-ResNet50 的 F1 得分为 0.90630,Jaccard 得分为 0.83230,Recall 为 0.92051,精度为 0.89789。我们提出了一个完整的端到端框架,包括神经网络的具体结构、训练数据集的构建、使用神经网络的预测方法以及后处理。类 CNN 架构可以广泛移植到该框架中实现语义分割,在一定程度上解决了大规模医学影像标签不足的问题。
{"title":"A weakly supervised end-to-end framework for semantic segmentation of cancerous area in whole slide image","authors":"Yanbo Feng, Adel Hafiane, Hélène Laurent","doi":"10.1007/s10044-024-01251-6","DOIUrl":"https://doi.org/10.1007/s10044-024-01251-6","url":null,"abstract":"<p>The segmentation of pathological image is an indispensable content in the cancerous diagnosis and grading, which is provided to doctors for the location and quantitative analysis of pathologically altered tissue. However, pathological whole slide image (WSI) generally has gigapixel size and huge region-level objective to be segmented. Extracting patches from WSI can address the limitation of computer memory, but the integrity of target is hence affected. Moreover, supervised learning methods require manually annotated labels for training, which is laborious and time-consuming. Thus, we studied a novel weakly supervised learning (WSL)-based end-to-end framework for semantic segmentation of cancerous area in WSI. The proposed framework is based on the block-level segmentation of convolutional neural network (CNN), while CNN is required to integrate the global average pooling layer and single fully connected layer as WSL-CNN. Class activation map and dense conditional random field (DenseCRF) are adapted to realize pixel-level segmentation of the cancerous area in patch, which is incorporated into the classification process of WSL-CNN. The hierarchically double use of DenseCRF effectively improves the precision of semantic segmentation. A region-based annotation method and a flexible method of constructing training dataset are proposed to reduce the workload of annotation. Experiments show that the block-level segmentation of CNNs has better performance than the pixel-level segmentation of fully convolutional networks, ResNet50 is the best one that achieves F1 score of 0.87426, Jaccard score of 0.78079, Recall of 0.94251 and Precision of 0.82182. The proposed framework can effectively refine the block-level prediction as semantic segmentation without pixel-level label. The precision of all tested CNNs get improved in the experiments, with WSL-ResNet50 achieving F1 score of 0.90630, Jaccard score of 0.83230, Recall of 0.92051 and Precision of 0.89789. We propose a complete end-to-end framework, including the specific structure of neural network, the construction of training dataset, the prediction method using neural network and the post-processing. CNN-like architectures can be widely transplanted into this framework to realize semantic segmentation, solving the problem of insufficient label of large-scale medical image to a certain extent.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"26 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140585176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data augmentation methods are crucial to improve the accuracy of densely occluded object recognition in the scene where the quantity and diversity of training images are insufficient. However, the current methods that use regional dropping and mixing strategies suffer from the problem of missing foreground objects and redundant background features, which can lead to densely occluded object recognition issues in classification or detection tasks. Herein, saliency information and mosaic based data augmentation method for densely occluded object recognition is proposed, which utilizes saliency information as prior knowledge to supervise the mosaic process of training images containing densely occluded objects. And the method uses fogging processing and class label mixing to construct new augmented images, in order to improve the accuracy of image classification and object recognition tasks by augmenting the quantity and diversity of training images. Extensive experiments on different classification datasets with various CNN architectures prove the effectiveness of our method.
{"title":"Saliency information and mosaic based data augmentation method for densely occluded object recognition","authors":"Ying Tong, Xiangfeng Luo, Liyan Ma, Shaorong Xie, Wenbin Yang, Yinsai Guo","doi":"10.1007/s10044-024-01258-z","DOIUrl":"https://doi.org/10.1007/s10044-024-01258-z","url":null,"abstract":"<p>Data augmentation methods are crucial to improve the accuracy of densely occluded object recognition in the scene where the quantity and diversity of training images are insufficient. However, the current methods that use regional dropping and mixing strategies suffer from the problem of missing foreground objects and redundant background features, which can lead to densely occluded object recognition issues in classification or detection tasks. Herein, saliency information and mosaic based data augmentation method for densely occluded object recognition is proposed, which utilizes saliency information as prior knowledge to supervise the mosaic process of training images containing densely occluded objects. And the method uses fogging processing and class label mixing to construct new augmented images, in order to improve the accuracy of image classification and object recognition tasks by augmenting the quantity and diversity of training images. Extensive experiments on different classification datasets with various CNN architectures prove the effectiveness of our method.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"43 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140322598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-19DOI: 10.1007/s10044-024-01259-y
Palanichamy Naveen, Mahmoud Hassaballah
Scene text detection poses a considerable challenge due to the diverse nature of text appearance, backgrounds, and orientations. Enhancing robustness, accuracy, and efficiency in this context is vital for several applications, such as optical character recognition, image understanding, and autonomous vehicles. This paper explores the integration of generative adversarial network (GAN) and network variational autoencoder (VAE) to create a robust and potent text detection network. The proposed architecture comprises three interconnected modules: the VAE module, the GAN module, and the text detection module. In this framework, the VAE module plays a pivotal role in generating diverse and variable text regions. Subsequently, the GAN module refines and enhances these regions, ensuring heightened realism and accuracy. Then, the text detection module takes charge of identifying text regions in the input image via assigning confidence scores to each region. The comprehensive training of the entire network involves minimizing a joint loss function that encompasses the VAE loss, the GAN loss, and the text detection loss. The VAE loss ensures diversity in generated text regions and the GAN loss guarantees realism and accuracy, while the text detection loss ensures high-precision identification of text regions. The proposed method employs an encoder-decoder structure within the VAE module and a generator-discriminator structure in the GAN module. Rigorous testing on diverse datasets including Total-Text, CTW1500, ICDAR 2015, ICDAR 2017, ReCTS, TD500, COCO-Text, SynthText, Street View Text, and KIAST Scene Text demonstrates the superior performance of the proposed method compared to existing approaches.
由于文字外观、背景和方向的多样性,场景文字检测是一项相当大的挑战。在这种情况下,提高鲁棒性、准确性和效率对于光学字符识别、图像理解和自动驾驶汽车等多种应用至关重要。本文探讨了生成式对抗网络(GAN)与网络变异自动编码器(VAE)的整合,以创建一个强大而有效的文本检测网络。所提出的架构由三个相互关联的模块组成:VAE 模块、GAN 模块和文本检测模块。在此框架中,VAE 模块在生成多样化和可变的文本区域方面发挥着关键作用。随后,GAN 模块对这些区域进行细化和增强,以确保更高的真实性和准确性。然后,文本检测模块负责通过为每个区域分配置信度分数来识别输入图像中的文本区域。整个网络的综合训练包括最小化联合损失函数,其中包括 VAE 损失、GAN 损失和文本检测损失。VAE 损失可确保生成文本区域的多样性,GAN 损失可确保真实性和准确性,而文本检测损失则可确保文本区域的高精度识别。所提出的方法在 VAE 模块中采用了编码器-解码器结构,在 GAN 模块中采用了生成器-鉴别器结构。在不同的数据集(包括 Total-Text、CTW1500、ICDAR 2015、ICDAR 2017、ReCTS、TD500、COCO-Text、SynthText、Street View Text 和 KIAST Scene Text)上进行的严格测试表明,与现有方法相比,所提出的方法具有更优越的性能。
{"title":"Scene text detection using structured information and an end-to-end trainable generative adversarial networks","authors":"Palanichamy Naveen, Mahmoud Hassaballah","doi":"10.1007/s10044-024-01259-y","DOIUrl":"https://doi.org/10.1007/s10044-024-01259-y","url":null,"abstract":"<p>Scene text detection poses a considerable challenge due to the diverse nature of text appearance, backgrounds, and orientations. Enhancing robustness, accuracy, and efficiency in this context is vital for several applications, such as optical character recognition, image understanding, and autonomous vehicles. This paper explores the integration of generative adversarial network (GAN) and network variational autoencoder (VAE) to create a robust and potent text detection network. The proposed architecture comprises three interconnected modules: the VAE module, the GAN module, and the text detection module. In this framework, the VAE module plays a pivotal role in generating diverse and variable text regions. Subsequently, the GAN module refines and enhances these regions, ensuring heightened realism and accuracy. Then, the text detection module takes charge of identifying text regions in the input image via assigning confidence scores to each region. The comprehensive training of the entire network involves minimizing a joint loss function that encompasses the VAE loss, the GAN loss, and the text detection loss. The VAE loss ensures diversity in generated text regions and the GAN loss guarantees realism and accuracy, while the text detection loss ensures high-precision identification of text regions. The proposed method employs an encoder-decoder structure within the VAE module and a generator-discriminator structure in the GAN module. Rigorous testing on diverse datasets including Total-Text, CTW1500, ICDAR 2015, ICDAR 2017, ReCTS, TD500, COCO-Text, SynthText, Street View Text, and KIAST Scene Text demonstrates the superior performance of the proposed method compared to existing approaches.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"1 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140168470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-14DOI: 10.1007/s10044-024-01256-1
Yuxiang Wu, Xiaoyan Wang, Tianpan Chen, Yan Dou
It is important to generate both diverse and representative video summary for massive videos. In this paper, a convolution neural network based on dual-stream attention mechanism(DA-ResNet) is designed to obtain candidate summary sequences for classroom scenes. DA-ResNet constructs a dual stream input of image frame sequence and optical flow frame sequence to enhance the expression ability. The network also embeds the attention mechanism into ResNet. On the other hand, the final video summary is obtained by removing redundant frames with the improved hash clustering algorithm. In this process, preprocessing is performed first to reduce computational complexity. And then hash clustering is used to retain the frame with the highest entropy value in each class, removing other similar frames. To verify its effectiveness in classroom scenes, we also created ClassVideo, a real dataset consisting of 45 videos from the normal teaching environment of our school. The results of the experiments show the competitiveness of the proposed method DA-ResNet outperforms the existing methods by about 8% in terms of the F-measure. Besides, the visual results also demonstrate its ability to produce classroom video summaries that are very close to the human preferences.
{"title":"DA-ResNet: dual-stream ResNet with attention mechanism for classroom video summary","authors":"Yuxiang Wu, Xiaoyan Wang, Tianpan Chen, Yan Dou","doi":"10.1007/s10044-024-01256-1","DOIUrl":"https://doi.org/10.1007/s10044-024-01256-1","url":null,"abstract":"<p>It is important to generate both diverse and representative video summary for massive videos. In this paper, a convolution neural network based on dual-stream attention mechanism(DA-ResNet) is designed to obtain candidate summary sequences for classroom scenes. DA-ResNet constructs a dual stream input of image frame sequence and optical flow frame sequence to enhance the expression ability. The network also embeds the attention mechanism into ResNet. On the other hand, the final video summary is obtained by removing redundant frames with the improved hash clustering algorithm. In this process, preprocessing is performed first to reduce computational complexity. And then hash clustering is used to retain the frame with the highest entropy value in each class, removing other similar frames. To verify its effectiveness in classroom scenes, we also created ClassVideo, a real dataset consisting of 45 videos from the normal teaching environment of our school. The results of the experiments show the competitiveness of the proposed method DA-ResNet outperforms the existing methods by about 8% in terms of the F-measure. Besides, the visual results also demonstrate its ability to produce classroom video summaries that are very close to the human preferences.</p>","PeriodicalId":54639,"journal":{"name":"Pattern Analysis and Applications","volume":"20 1","pages":""},"PeriodicalIF":3.9,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140155039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The article presents a novel methodology that comprises of end-to-end Venus’ visible image processing neoteric workflow. The visible raw image is denoised using Tri-State median filter with background dark subtraction, and then enhanced using Contrast Limited Adaptive Histogram Equalization. The multi-modal image registration technique is developed using Segmented Affine Scale Invariant Feature Transform and Motion Smoothness Constraint outlier removal for co-registration of Venus’ visible and radar image. A novel image fusion algorithm using guided filter is developed to merge multi-modal Visible-Radar Venus’ image pair for generating the fused image. The Venus’ visible image quality assessment is performed at each processing step, and results are quantified and visualized. In addition, fuzzy color-coded segmentation map is generated for crucial information retrieval about Venus’ surface feature characteristics. It is found that Venus’ fused image clearly demarked planetary morphological features and validated with publicly available Venus’ radar nomenclature map.