{"title":"Experiments And Discussions On Vision Transformer (ViT) Parameters For Object Tracking","authors":"Daiki Fukushima, Tomokazu Ishikawa","doi":"10.1109/NicoInt55861.2022.00020","DOIUrl":null,"url":null,"abstract":"Recently, machine learning has been used to improve the accuracy of computer vision, and the latest network model, Transformer, has been widely used in the fields of natural language translation and object recognition. A feature of ViT used in the field of object recognition is that its accuracy is improved by accumulating layers of Transformers. However, the latest models of the previous study of object tracking show that the accuracy decreases as the layers of the Transformer are accumulated. Therefore, in this study, we thought that the accuracy could be improved by changing the experimental conditions while the layers of transformers are accumulated. In addition, by searching for hyperparameters in the loss function, we expect to further improve the accuracy. The experimental results show that the accuracy can be improved by 5% by adjusting the parameters of regression loss and loss on bounding box size. Also, the model used in this study has a problem that the accuracy decreased by up to 7% when the number of Transformer layers is increased. Although the accuracy improved by 2% compared to the model without adjusting the parameters when the parameters of the loss function are adjusted with the number of Transformer layers increased.","PeriodicalId":328114,"journal":{"name":"2022 Nicograph International (NicoInt)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Nicograph International (NicoInt)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NicoInt55861.2022.00020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Recently, machine learning has been used to improve the accuracy of computer vision, and the latest network model, Transformer, has been widely used in the fields of natural language translation and object recognition. A feature of ViT used in the field of object recognition is that its accuracy is improved by accumulating layers of Transformers. However, the latest models of the previous study of object tracking show that the accuracy decreases as the layers of the Transformer are accumulated. Therefore, in this study, we thought that the accuracy could be improved by changing the experimental conditions while the layers of transformers are accumulated. In addition, by searching for hyperparameters in the loss function, we expect to further improve the accuracy. The experimental results show that the accuracy can be improved by 5% by adjusting the parameters of regression loss and loss on bounding box size. Also, the model used in this study has a problem that the accuracy decreased by up to 7% when the number of Transformer layers is increased. Although the accuracy improved by 2% compared to the model without adjusting the parameters when the parameters of the loss function are adjusted with the number of Transformer layers increased.