Liang Gong, Hang Dong, Xinyu Zhang, Xin Cheng, Fan Ye, Liangchao Guo, Zhenghui Ge
{"title":"尖峰 ViT:用于钢材表面缺陷分类的尖峰神经网络与变压器注意事项","authors":"Liang Gong, Hang Dong, Xinyu Zhang, Xin Cheng, Fan Ye, Liangchao Guo, Zhenghui Ge","doi":"10.1117/1.jei.33.3.033001","DOIUrl":null,"url":null,"abstract":"Throughout the steel production process, a variety of surface defects inevitably occur. These defects can impair the quality of steel products and reduce manufacturing efficiency. Therefore, it is crucial to study and categorize the multiple defects on the surface of steel strips. Vision transformer (ViT) is a unique neural network model based on a self-attention mechanism that is widely used in many different disciplines. Conventional ViT ignores the specifics of brain signaling and instead uses activation functions to simulate genuine neurons. One of the fundamental building blocks of a spiking neural network is leaky integration and fire (LIF), which has biodynamic characteristics akin to those of a genuine neuron. LIF neurons work in an event-driven manner such that higher performance can be achieved with less power. The goal of this work is to integrate ViT and LIF neurons to build and train an end-to-end hybrid network architecture, spiking vision transformer (S-ViT), for the classification of steel surface defects. The framework relies on the ViT architecture by replacing the activation functions used in ViT with LIF neurons, constructing a global spike feature fusion module spiking transformer encoder as well as a spiking-MLP classification head for implementing the classification functionality and using it as a basic building block of S-ViT. Based on the experimental results, our method has demonstrated outstanding classification performance across all metrics. The overall test accuracies of S-ViT are 99.41%, 99.65%, 99.54%, and 99.77% on NEU-CLSs, and 95.70%, 95.93%, 96.94%, and 97.19% on XSDD. S-ViT achieves superior classification performance compared to convolutional neural networks and recent findings. Its performance is also improved relative to the original ViT model. Furthermore, the robustness test results of S-ViT show that S-ViT still maintains reliable accuracy when recognizing images that contain Gaussian noise.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"29 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spiking ViT: spiking neural networks with transformer—attention for steel surface defect classification\",\"authors\":\"Liang Gong, Hang Dong, Xinyu Zhang, Xin Cheng, Fan Ye, Liangchao Guo, Zhenghui Ge\",\"doi\":\"10.1117/1.jei.33.3.033001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Throughout the steel production process, a variety of surface defects inevitably occur. These defects can impair the quality of steel products and reduce manufacturing efficiency. Therefore, it is crucial to study and categorize the multiple defects on the surface of steel strips. Vision transformer (ViT) is a unique neural network model based on a self-attention mechanism that is widely used in many different disciplines. Conventional ViT ignores the specifics of brain signaling and instead uses activation functions to simulate genuine neurons. One of the fundamental building blocks of a spiking neural network is leaky integration and fire (LIF), which has biodynamic characteristics akin to those of a genuine neuron. LIF neurons work in an event-driven manner such that higher performance can be achieved with less power. The goal of this work is to integrate ViT and LIF neurons to build and train an end-to-end hybrid network architecture, spiking vision transformer (S-ViT), for the classification of steel surface defects. The framework relies on the ViT architecture by replacing the activation functions used in ViT with LIF neurons, constructing a global spike feature fusion module spiking transformer encoder as well as a spiking-MLP classification head for implementing the classification functionality and using it as a basic building block of S-ViT. Based on the experimental results, our method has demonstrated outstanding classification performance across all metrics. The overall test accuracies of S-ViT are 99.41%, 99.65%, 99.54%, and 99.77% on NEU-CLSs, and 95.70%, 95.93%, 96.94%, and 97.19% on XSDD. S-ViT achieves superior classification performance compared to convolutional neural networks and recent findings. Its performance is also improved relative to the original ViT model. Furthermore, the robustness test results of S-ViT show that S-ViT still maintains reliable accuracy when recognizing images that contain Gaussian noise.\",\"PeriodicalId\":54843,\"journal\":{\"name\":\"Journal of Electronic Imaging\",\"volume\":\"29 1\",\"pages\":\"\"},\"PeriodicalIF\":1.0000,\"publicationDate\":\"2024-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Electronic Imaging\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1117/1.jei.33.3.033001\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electronic Imaging","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1117/1.jei.33.3.033001","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
摘要
在整个钢铁生产过程中,不可避免地会出现各种表面缺陷。这些缺陷会损害钢铁产品的质量,降低生产效率。因此,对钢带表面的多种缺陷进行研究和分类至关重要。视觉变换器(ViT)是一种基于自我注意机制的独特神经网络模型,被广泛应用于许多不同学科。传统的 ViT 忽略了大脑信号传递的具体细节,而是使用激活函数来模拟真正的神经元。尖峰神经网络的基本构件之一是泄漏整合与发射(LIF),它具有与真正神经元类似的生物动力学特征。LIF 神经元以事件驱动的方式工作,因此能以更低的功耗实现更高的性能。这项工作的目标是整合 ViT 神经元和 LIF 神经元,构建并训练一种端到端混合网络架构--尖峰视觉转换器(S-ViT),用于对钢铁表面缺陷进行分类。该框架以 ViT 架构为基础,用 LIF 神经元替换了 ViT 中使用的激活函数,构建了全局尖峰特征融合模块尖峰变换器编码器和尖峰-MLP 分类头,以实现分类功能,并将其作为 S-ViT 的基本构建模块。根据实验结果,我们的方法在所有指标上都表现出了出色的分类性能。在 NEU-CLS 上,S-ViT 的总体测试准确率分别为 99.41%、99.65%、99.54% 和 99.77%;在 XSDD 上,S-ViT 的总体测试准确率分别为 95.70%、95.93%、96.94% 和 97.19%。与卷积神经网络和最新研究成果相比,S-ViT 的分类性能更为出色。与原始 ViT 模型相比,其性能也有所提高。此外,S-ViT 的鲁棒性测试结果表明,在识别含有高斯噪声的图像时,S-ViT 仍能保持可靠的准确性。
Spiking ViT: spiking neural networks with transformer—attention for steel surface defect classification
Throughout the steel production process, a variety of surface defects inevitably occur. These defects can impair the quality of steel products and reduce manufacturing efficiency. Therefore, it is crucial to study and categorize the multiple defects on the surface of steel strips. Vision transformer (ViT) is a unique neural network model based on a self-attention mechanism that is widely used in many different disciplines. Conventional ViT ignores the specifics of brain signaling and instead uses activation functions to simulate genuine neurons. One of the fundamental building blocks of a spiking neural network is leaky integration and fire (LIF), which has biodynamic characteristics akin to those of a genuine neuron. LIF neurons work in an event-driven manner such that higher performance can be achieved with less power. The goal of this work is to integrate ViT and LIF neurons to build and train an end-to-end hybrid network architecture, spiking vision transformer (S-ViT), for the classification of steel surface defects. The framework relies on the ViT architecture by replacing the activation functions used in ViT with LIF neurons, constructing a global spike feature fusion module spiking transformer encoder as well as a spiking-MLP classification head for implementing the classification functionality and using it as a basic building block of S-ViT. Based on the experimental results, our method has demonstrated outstanding classification performance across all metrics. The overall test accuracies of S-ViT are 99.41%, 99.65%, 99.54%, and 99.77% on NEU-CLSs, and 95.70%, 95.93%, 96.94%, and 97.19% on XSDD. S-ViT achieves superior classification performance compared to convolutional neural networks and recent findings. Its performance is also improved relative to the original ViT model. Furthermore, the robustness test results of S-ViT show that S-ViT still maintains reliable accuracy when recognizing images that contain Gaussian noise.
期刊介绍:
The Journal of Electronic Imaging publishes peer-reviewed papers in all technology areas that make up the field of electronic imaging and are normally considered in the design, engineering, and applications of electronic imaging systems.