猫类与犬科动物图像计算机视觉模型的比较与分析

International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022) Pub Date : 2023-04-28 DOI:10.1117/12.2671468

Feng Jiang, Yueyufei Ma, Langyue Wang

{"title":"猫类与犬科动物图像计算机视觉模型的比较与分析","authors":"Feng Jiang, Yueyufei Ma, Langyue Wang","doi":"10.1117/12.2671468","DOIUrl":null,"url":null,"abstract":"Nowadays, target recognition, driverless, medical impact diagnosis, and other applications based on image recognition in life, scientific research, and work, rely mainly on a variety of large models with excellent performance, from the Convolutional Neural Network (CNN) at the beginning to the various variants of the classical model proposed now. In this paper, we will take the example of identifying catamount and canid datasets, comparing the efficiency and accuracy of CNN, Vision Transformer (ViT), and Swin Transformer laterally. We plan to run 25 epochs for each model and record the accuracy and time consumption separately. After the experiments we find that from the comparison of the epoch numbers and the real-time consumption, the CNN takes the least total time, followed by Swin Transformer. Also, ViT takes the least time to reach convergence, while Swin Transformer takes the most time. In terms of training accuracy, ViT has the highest training accuracy, followed by Swin Transformer, and CNN has the lowest training accuracy; the validation accuracy is similar to the training accuracy. ViT has the highest accuracy, but takes the longest time; conversely, CNN takes the shortest time and has the lowest accuracy. Swin Transformer, which seems a combination of CNN and ViT, is most complex but with ideal performance. In the future, ViT is indeed a promising model that deserves further research and exploration to contribute to the computer vision field.","PeriodicalId":227528,"journal":{"name":"International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison and analysis of computer vision models based on images of catamount and canid\",\"authors\":\"Feng Jiang, Yueyufei Ma, Langyue Wang\",\"doi\":\"10.1117/12.2671468\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, target recognition, driverless, medical impact diagnosis, and other applications based on image recognition in life, scientific research, and work, rely mainly on a variety of large models with excellent performance, from the Convolutional Neural Network (CNN) at the beginning to the various variants of the classical model proposed now. In this paper, we will take the example of identifying catamount and canid datasets, comparing the efficiency and accuracy of CNN, Vision Transformer (ViT), and Swin Transformer laterally. We plan to run 25 epochs for each model and record the accuracy and time consumption separately. After the experiments we find that from the comparison of the epoch numbers and the real-time consumption, the CNN takes the least total time, followed by Swin Transformer. Also, ViT takes the least time to reach convergence, while Swin Transformer takes the most time. In terms of training accuracy, ViT has the highest training accuracy, followed by Swin Transformer, and CNN has the lowest training accuracy; the validation accuracy is similar to the training accuracy. ViT has the highest accuracy, but takes the longest time; conversely, CNN takes the shortest time and has the lowest accuracy. Swin Transformer, which seems a combination of CNN and ViT, is most complex but with ideal performance. In the future, ViT is indeed a promising model that deserves further research and exploration to contribute to the computer vision field.\",\"PeriodicalId\":227528,\"journal\":{\"name\":\"International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2671468\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2671468","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

如今，基于图像识别的目标识别、无人驾驶、医疗影响诊断等在生活、科研和工作中的应用，主要依赖于各种性能优异的大型模型，从最开始的卷积神经网络(CNN)到现在提出的经典模型的各种变体。本文将以识别猫和狗的数据集为例，横向比较CNN、Vision Transformer (ViT)和Swin Transformer的效率和准确性。我们计划为每个模型运行25个epoch，并分别记录准确率和耗时。实验发现，从历元数和实时消耗的比较来看，CNN的总时间最少，其次是Swin Transformer。此外，ViT需要最少的时间来达到收敛，而Swin Transformer需要最多的时间。在训练精度方面，ViT的训练精度最高，Swin Transformer次之，CNN的训练精度最低;验证精度与训练精度相近。ViT精度最高，但耗时最长;相反，CNN耗时最短，准确率最低。Swin Transformer看起来是CNN和ViT的结合，它是最复杂的，但性能却很理想。在未来，ViT确实是一个很有前途的模型，值得进一步的研究和探索，为计算机视觉领域做出贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Comparison and analysis of computer vision models based on images of catamount and canid

Nowadays, target recognition, driverless, medical impact diagnosis, and other applications based on image recognition in life, scientific research, and work, rely mainly on a variety of large models with excellent performance, from the Convolutional Neural Network (CNN) at the beginning to the various variants of the classical model proposed now. In this paper, we will take the example of identifying catamount and canid datasets, comparing the efficiency and accuracy of CNN, Vision Transformer (ViT), and Swin Transformer laterally. We plan to run 25 epochs for each model and record the accuracy and time consumption separately. After the experiments we find that from the comparison of the epoch numbers and the real-time consumption, the CNN takes the least total time, followed by Swin Transformer. Also, ViT takes the least time to reach convergence, while Swin Transformer takes the most time. In terms of training accuracy, ViT has the highest training accuracy, followed by Swin Transformer, and CNN has the lowest training accuracy; the validation accuracy is similar to the training accuracy. ViT has the highest accuracy, but takes the longest time; conversely, CNN takes the shortest time and has the lowest accuracy. Swin Transformer, which seems a combination of CNN and ViT, is most complex but with ideal performance. In the future, ViT is indeed a promising model that deserves further research and exploration to contribute to the computer vision field.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)

自引率

0.00%

发文量