Peiqiang Liu, Qifeng Liang, Zhiyong An, Jingyi Fu, Yanyan Mao
Most Siamese-based trackers use classification and regression to determine the target bounding box, which can be formulated as a linear matching process of the template and search region. However, this only takes into account the similarity of features while ignoring the semantic object information, resulting in some cases in which the regression box with the highest classification score is not accurate. To address the lack of semantic information, an object tracking approach based on an ensemble semantic-aware network and redetection (ESART) is proposed. Furthermore, a DarkNet53 network with transfer learning is used as our semantic-aware model to adapt the detection task for extracting semantic information. In addition, a semantic tag redetection method to re-evaluate the bounding box and overcome inaccurate scaling issues is proposed. Extensive experiments based on OTB2015, UAV123, UAV20L, and GOT-10k show that our tracker is superior to other state-of-the-art trackers. It is noteworthy that our semantic-aware ensemble method can be embedded into any tracker for classification and regression task.
{"title":"Robust object tracking via ensembling semantic-aware network and redetection","authors":"Peiqiang Liu, Qifeng Liang, Zhiyong An, Jingyi Fu, Yanyan Mao","doi":"10.1049/cvi2.12219","DOIUrl":"10.1049/cvi2.12219","url":null,"abstract":"<p>Most Siamese-based trackers use classification and regression to determine the target bounding box, which can be formulated as a linear matching process of the template and search region. However, this only takes into account the similarity of features while ignoring the semantic object information, resulting in some cases in which the regression box with the highest classification score is not accurate. To address the lack of semantic information, an object tracking approach based on an ensemble semantic-aware network and redetection (ESART) is proposed. Furthermore, a DarkNet53 network with transfer learning is used as our semantic-aware model to adapt the detection task for extracting semantic information. In addition, a semantic tag redetection method to re-evaluate the bounding box and overcome inaccurate scaling issues is proposed. Extensive experiments based on OTB2015, UAV123, UAV20L, and GOT-10k show that our tracker is superior to other state-of-the-art trackers. It is noteworthy that our semantic-aware ensemble method can be embedded into any tracker for classification and regression task.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"46-59"},"PeriodicalIF":1.7,"publicationDate":"2023-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12219","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42081075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent studies reveal the crucial role of local features in learning robust and discriminative representations for person re-identification (Re-ID). Existing approaches typically rely on external tasks, for example, semantic segmentation, or pose estimation, to locate identifiable parts of given images. However, they heuristically utilise the predictions from off-the-shelf models, which may be sub-optimal in terms of both local partition and computational efficiency. They also ignore the mutual information with other inputs, which weakens the representation capabilities of local features. In this study, the authors put forward a novel Attribute-guided Transformer (AiT), which explicitly exploits pedestrian attributes as semantic priors for discriminative representation learning. Specifically, the authors first introduce an attribute learning process, which generates a set of attention maps highlighting the informative parts of pedestrian images. Then, the authors design a Feature Diffusion Module (FDM) to iteratively inject attribute information into global feature maps, aiming at suppressing unnecessary noise and inferring attribute-aware representations. Last, the authors propose a Feature Aggregation Module (FAM) to exploit mutual information for aggregating attribute characteristics from different images, enhancing the representation capabilities of feature embedding. Extensive experiments demonstrate the superiority of our AiT in learning robust and discriminative representations. As a result, the authors achieve competitive performance with state-of-the-art methods on several challenging benchmarks without any bells and whistles.
最近的研究揭示了局部特征在学习稳健且具有辨别力的人物再识别(Re-ID)表征中的关键作用。现有方法通常依赖于外部任务,例如语义分割或姿势估计,来定位给定图像的可识别部分。然而,这些方法只是启发式地利用现成模型的预测结果,在局部分割和计算效率方面可能都不是最佳的。它们还忽略了与其他输入的互信息,从而削弱了局部特征的表示能力。在这项研究中,作者提出了一种新颖的 "属性引导转换器"(Attribute-guided Transformer,AiT),该转换器明确利用行人属性作为语义前置条件来进行判别表征学习。具体来说,作者首先引入了一个属性学习过程,该过程会生成一组注意力地图,突出行人图像的信息部分。然后,作者设计了一个特征扩散模块(FDM),迭代地将属性信息注入全局特征图中,旨在抑制不必要的噪音,推断出属性感知表征。最后,作者提出了特征聚合模块(FAM),利用互信息聚合不同图像的属性特征,增强特征嵌入的表示能力。广泛的实验证明了我们的 AiT 在学习鲁棒性和鉴别性表征方面的优越性。因此,作者在几个具有挑战性的基准测试中取得了与最先进方法相媲美的性能,而且没有任何附加功能。
{"title":"Attribute-guided transformer for robust person re-identification","authors":"Zhe Wang, Jun Wang, Junliang Xing","doi":"10.1049/cvi2.12215","DOIUrl":"10.1049/cvi2.12215","url":null,"abstract":"<p>Recent studies reveal the crucial role of local features in learning robust and discriminative representations for person re-identification (Re-ID). Existing approaches typically rely on external tasks, for example, semantic segmentation, or pose estimation, to locate identifiable parts of given images. However, they heuristically utilise the predictions from off-the-shelf models, which may be sub-optimal in terms of both local partition and computational efficiency. They also ignore the mutual information with other inputs, which weakens the representation capabilities of local features. In this study, the authors put forward a novel Attribute-guided Transformer (AiT), which explicitly exploits pedestrian attributes as semantic priors for discriminative representation learning. Specifically, the authors first introduce an attribute learning process, which generates a set of attention maps highlighting the informative parts of pedestrian images. Then, the authors design a Feature Diffusion Module (FDM) to iteratively inject attribute information into global feature maps, aiming at suppressing unnecessary noise and inferring attribute-aware representations. Last, the authors propose a Feature Aggregation Module (FAM) to exploit mutual information for aggregating attribute characteristics from different images, enhancing the representation capabilities of feature embedding. Extensive experiments demonstrate the superiority of our AiT in learning robust and discriminative representations. As a result, the authors achieve competitive performance with state-of-the-art methods on several challenging benchmarks without any bells and whistles.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 8","pages":"977-992"},"PeriodicalIF":1.7,"publicationDate":"2023-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12215","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49366041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The use of deep neural networks has revolutionised object tracking tasks, and Siamese trackers have emerged as a prominent technique for this purpose. Existing Siamese trackers use a fixed template or template updating technique, but it is prone to overfitting, lacks the capacity to exploit global temporal sequences, and cannot utilise multi-layer features. As a result, it is challenging to deal with dramatic appearance changes in complicated scenarios. Siamese trackers also struggle to learn background information, which impairs their discriminative ability. Hence, two transformer-based modules, the Spatio-Temporal Fusion (ST) module and the Discriminative Enhancement (DE) module, are proposed to improve the performance of Siamese trackers. The ST module leverages cross-attention to accumulate global temporal cues and generates an attention matrix with ST similarity to enhance the template's adaptability to changes in target appearance. The DE module associates semantically similar points from the template and search area, thereby generating a learnable discriminative mask to enhance the discriminative ability of the Siamese trackers. In addition, a Multi-Layer ST module (ST + ML) was constructed, which can be integrated into Siamese trackers based on multi-layer cross-correlation for further improvement. The authors evaluate the proposed modules on four public datasets and show comparative performance compared to existing Siamese trackers.
深度神经网络的使用给物体追踪任务带来了革命性的变化,而连体追踪器已成为这方面的一项突出技术。现有的连体追踪器使用固定模板或模板更新技术,但容易出现过度拟合,缺乏利用全局时序的能力,也无法利用多层特征。因此,在复杂场景中处理剧烈的外观变化具有挑战性。连体跟踪器在学习背景信息方面也很吃力,这削弱了其辨别能力。因此,我们提出了两个基于变换器的模块,即时空融合(ST)模块和判别增强(DE)模块,以提高连体跟踪器的性能。ST 模块利用交叉注意力来积累全局时间线索,并生成具有 ST 相似性的注意力矩阵,以增强模板对目标外观变化的适应性。DE 模块将模板和搜索区域中语义相似的点联系起来,从而生成可学习的分辨掩码,以增强连体跟踪器的分辨能力。此外,作者还构建了一个多层 ST 模块(ST + ML),该模块可集成到基于多层交叉相关的连体跟踪器中,以进一步提高跟踪能力。作者在四个公共数据集上对所提出的模块进行了评估,并显示了与现有连体跟踪器的比较性能。
{"title":"DASTSiam: Spatio-temporal fusion and discriminative enhancement for Siamese visual tracking","authors":"Yucheng Huang, Eksan Firkat, Jinlai Zhang, Lijuan Zhu, Bin Zhu, Jihong Zhu, Askar Hamdulla","doi":"10.1049/cvi2.12213","DOIUrl":"10.1049/cvi2.12213","url":null,"abstract":"<p>The use of deep neural networks has revolutionised object tracking tasks, and Siamese trackers have emerged as a prominent technique for this purpose. Existing Siamese trackers use a fixed template or template updating technique, but it is prone to overfitting, lacks the capacity to exploit global temporal sequences, and cannot utilise multi-layer features. As a result, it is challenging to deal with dramatic appearance changes in complicated scenarios. Siamese trackers also struggle to learn background information, which impairs their discriminative ability. Hence, two transformer-based modules, the Spatio-Temporal Fusion (ST) module and the Discriminative Enhancement (DE) module, are proposed to improve the performance of Siamese trackers. The ST module leverages cross-attention to accumulate global temporal cues and generates an attention matrix with ST similarity to enhance the template's adaptability to changes in target appearance. The DE module associates semantically similar points from the template and search area, thereby generating a learnable discriminative mask to enhance the discriminative ability of the Siamese trackers. In addition, a Multi-Layer ST module (ST + ML) was constructed, which can be integrated into Siamese trackers based on multi-layer cross-correlation for further improvement. The authors evaluate the proposed modules on four public datasets and show comparative performance compared to existing Siamese trackers.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 8","pages":"1017-1033"},"PeriodicalIF":1.7,"publicationDate":"2023-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12213","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48829474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The following article for this Special Issue was published in a different issue","authors":"","doi":"10.1049/cvi2.12211","DOIUrl":"https://doi.org/10.1049/cvi2.12211","url":null,"abstract":"","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 1","pages":"614"},"PeriodicalIF":1.7,"publicationDate":"2023-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"57700580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The following article for this Special Issue was published in a different issue","authors":"","doi":"10.1049/cvi2.12211","DOIUrl":"https://doi.org/10.1049/cvi2.12211","url":null,"abstract":"<p>Fan Liu, Feifan Li, Sai Yang. Few-shot classification using Gaussianisation prototypical classifier.</p><p>IET Computer Vision 2023 February; 17(1); 62–75. https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/cvi2.12129</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 5","pages":"614"},"PeriodicalIF":1.7,"publicationDate":"2023-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12211","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50151747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The existing monocular depth estimation methods based on deep learning have difficulty in estimating the depth near the edges of the objects in an image when the depth distance between these objects changes abruptly and decline in accuracy when an image has more noises. Furthermore, these methods consume more hardware resources because they have huge network parameters. To solve these problems, this paper proposes a depth estimation method based on weighted fusion and point-wise convolution. The authors design a maximum-average adaptive pooling weighted fusion module (MAWF) that fuses global features and local features and a continuous point-wise convolution module for processing the fused features derived from the (MAWF) module. The two modules work closely together for three times to perform weighted fusion and point-wise convolution of features of multi-scale from the encoder output, which can better decode the depth information of a scene. Experimental results show that our method achieves state-of-the-art performance on the KITTI dataset with δ1 up to 0.996 and the root mean square error metric down to 8% and has demonstrated the strong generalisation and robustness.
{"title":"A monocular image depth estimation method based on weighted fusion and point-wise convolution","authors":"Chen Lei, Liang Zhengyou, Sun Yu","doi":"10.1049/cvi2.12212","DOIUrl":"10.1049/cvi2.12212","url":null,"abstract":"<p>The existing monocular depth estimation methods based on deep learning have difficulty in estimating the depth near the edges of the objects in an image when the depth distance between these objects changes abruptly and decline in accuracy when an image has more noises. Furthermore, these methods consume more hardware resources because they have huge network parameters. To solve these problems, this paper proposes a depth estimation method based on weighted fusion and point-wise convolution. The authors design a maximum-average adaptive pooling weighted fusion module (MAWF) that fuses global features and local features and a continuous point-wise convolution module for processing the fused features derived from the (MAWF) module. The two modules work closely together for three times to perform weighted fusion and point-wise convolution of features of multi-scale from the encoder output, which can better decode the depth information of a scene. Experimental results show that our method achieves state-of-the-art performance on the KITTI dataset with <b><i>δ</i></b><sub>1</sub> up to 0.996 and the root mean square error metric down to 8% and has demonstrated the strong generalisation and robustness.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 8","pages":"1005-1016"},"PeriodicalIF":1.7,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12212","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46812260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saba Sadat Faghih Imani, Kazim Fouladi-Ghaleh, Hossein Aghababa
Most of the successful person re-ID models conduct supervised training and need a large number of training data. These models fail to generalise well on unseen unlabelled testing sets. The authors aim to learn a generalisable person re-identification model. The model uses one labelled source dataset and one unlabelled target dataset during training and generalises well on the target testing set. To this end, after a feature extraction by the ResNext-50 network, the authors optimise the model by three loss functions. (a) One loss function is designed to learn the features of the target domain by tuning the distances between target images. Therefore, the trained model will be more robust to overcome the intra-domain variations in the target domain and generalises well on the target testing set. (b) One triplet loss is used which considers both source and target domains and makes the model learn the inter-domain variations between source and target domain as well as the variations in the target domain. (c) Also, one loss function is for supervised learning on the labelled source domain. Extensive experiments on Market1501 and DukeMTMC re-ID show that the model achieves a very competitive performance compared with state-of-the-art models and also it requires an acceptable amount of GPU RAM compared to other successful models.
{"title":"Generalizable and efficient cross-domain person re-identification model using deep metric learning","authors":"Saba Sadat Faghih Imani, Kazim Fouladi-Ghaleh, Hossein Aghababa","doi":"10.1049/cvi2.12214","DOIUrl":"10.1049/cvi2.12214","url":null,"abstract":"<p>Most of the successful person re-ID models conduct supervised training and need a large number of training data. These models fail to generalise well on unseen unlabelled testing sets. The authors aim to learn a generalisable person re-identification model. The model uses one labelled source dataset and one unlabelled target dataset during training and generalises well on the target testing set. To this end, after a feature extraction by the ResNext-50 network, the authors optimise the model by three loss functions. (a) One loss function is designed to learn the features of the target domain by tuning the distances between target images. Therefore, the trained model will be more robust to overcome the intra-domain variations in the target domain and generalises well on the target testing set. (b) One triplet loss is used which considers both source and target domains and makes the model learn the inter-domain variations between source and target domain as well as the variations in the target domain. (c) Also, one loss function is for supervised learning on the labelled source domain. Extensive experiments on Market1501 and DukeMTMC re-ID show that the model achieves a very competitive performance compared with state-of-the-art models and also it requires an acceptable amount of GPU RAM compared to other successful models.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 8","pages":"993-1004"},"PeriodicalIF":1.7,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12214","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41952356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Erratum: Integration graph attention network and multi-centre constrained loss for cross-modality person re-identification","authors":"","doi":"10.1049/cvi2.12210","DOIUrl":"https://doi.org/10.1049/cvi2.12210","url":null,"abstract":"","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"48 1","pages":"722"},"PeriodicalIF":1.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"57700469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The authors wish to bring to the readers' attention the following errors in the article by He, D., et al.: Integration graph attention network and multi-centre constrained loss for cross-modality person re-identification [1].
In Funding Information section the funding number for National Natural Science Foundation of China is incorrectly mentioned as 2022KYCX032Z. It should be 62171321.
{"title":"Erratum: Integration graph attention network and multi-centre constrained loss for cross-modality person re-identification","authors":"","doi":"10.1049/cvi2.12210","DOIUrl":"https://doi.org/10.1049/cvi2.12210","url":null,"abstract":"<p>The authors wish to bring to the readers' attention the following errors in the article by He, D., et al.: Integration graph attention network and multi-centre constrained loss for cross-modality person re-identification [<span>1</span>].</p><p>In Funding Information section the funding number for National Natural Science Foundation of China is incorrectly mentioned as 2022KYCX032Z. It should be 62171321.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 6","pages":"722"},"PeriodicalIF":1.7,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12210","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50127724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Cao, Jianqiang Yin, Yanan Guo, Kangning Du, Fan Zhang
Sketch face recognition has a wide range of applications in criminal investigation, but it remains a challenging task due to the small-scale sample and the semantic deficiencies caused by cross-modality differences. The authors propose a light semantic Transformer network to extract and model the semantic information of cross-modality images. First, the authors employ a meta-learning training strategy to obtain task-related training samples to solve the small sample problem. Then to solve the contradiction between the high complexity of the Transformer and the small sample problem of sketch face recognition, the authors build the light semantic transformer network by proposing a hierarchical group linear transformation and introducing parameter sharing, which can extract highly discriminative semantic features on small–scale datasets. Finally, the authors propose a domain-adaptive focal loss to reduce the cross-modality differences between sketches and photos and improve the training effect of the light semantic Transformer network. Extensive experiments have shown that the features extracted by the proposed method have significant discriminative effects. The authors’ method improves the recognition rate by 7.6% on the UoM-SGFSv2 dataset, and the recognition rate reaches 92.59% on the CUFSF dataset.
{"title":"Sketch face recognition based on light semantic Transformer network","authors":"Lin Cao, Jianqiang Yin, Yanan Guo, Kangning Du, Fan Zhang","doi":"10.1049/cvi2.12209","DOIUrl":"10.1049/cvi2.12209","url":null,"abstract":"<p>Sketch face recognition has a wide range of applications in criminal investigation, but it remains a challenging task due to the small-scale sample and the semantic deficiencies caused by cross-modality differences. The authors propose a light semantic Transformer network to extract and model the semantic information of cross-modality images. First, the authors employ a meta-learning training strategy to obtain task-related training samples to solve the small sample problem. Then to solve the contradiction between the high complexity of the Transformer and the small sample problem of sketch face recognition, the authors build the light semantic transformer network by proposing a hierarchical group linear transformation and introducing parameter sharing, which can extract highly discriminative semantic features on small–scale datasets. Finally, the authors propose a domain-adaptive focal loss to reduce the cross-modality differences between sketches and photos and improve the training effect of the light semantic Transformer network. Extensive experiments have shown that the features extracted by the proposed method have significant discriminative effects. The authors’ method improves the recognition rate by 7.6% on the UoM-SGFSv2 dataset, and the recognition rate reaches 92.59% on the CUFSF dataset.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 8","pages":"962-976"},"PeriodicalIF":1.7,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ietresearch.onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12209","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135641694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}