Image-text retrieval is a fundamental yet challenging task, which aims to bridge a semantic gap between heterogeneous data to achieve precise measurements of semantic similarity. The technique of fine-grained alignment between cross-modal features plays a key role in various successful methods that have been proposed. Nevertheless, existing methods cannot effectively utilise intra-modal information to enhance feature representation and lack powerful similarity reasoning to get a precise similarity score. Intending to tackle these issues, a context-aware Relation Enhancement and Similarity Reasoning model, called RESR, is proposed, which conducts both intra-modal relation enhancement and inter-modal similarity reasoning while considering the global-context information. For intra-modal relation enhancement, a novel context-aware graph convolutional network is introduced to enhance local feature representations by utilising relation and global-context information. For inter-modal similarity reasoning, local and global similarity features are exploited by the bidirectional alignment of image and text, and the similarity reasoning is implemented among multi-granularity similarity features. Finally, refined local and global similarity features are adaptively fused to get a precise similarity score. The experimental results show that our effective model outperforms some state-of-the-art approaches, achieving average improvements of 2.5% and 6.3% in R@sum on the Flickr30K and MS-COCO dataset.
{"title":"Context-aware relation enhancement and similarity reasoning for image-text retrieval","authors":"Zheng Cui, Yongli Hu, Yanfeng Sun, Baocai Yin","doi":"10.1049/cvi2.12270","DOIUrl":"10.1049/cvi2.12270","url":null,"abstract":"<p>Image-text retrieval is a fundamental yet challenging task, which aims to bridge a semantic gap between heterogeneous data to achieve precise measurements of semantic similarity. The technique of fine-grained alignment between cross-modal features plays a key role in various successful methods that have been proposed. Nevertheless, existing methods cannot effectively utilise intra-modal information to enhance feature representation and lack powerful similarity reasoning to get a precise similarity score. Intending to tackle these issues, a context-aware Relation Enhancement and Similarity Reasoning model, called RESR, is proposed, which conducts both intra-modal relation enhancement and inter-modal similarity reasoning while considering the global-context information. For intra-modal relation enhancement, a novel context-aware graph convolutional network is introduced to enhance local feature representations by utilising relation and global-context information. For inter-modal similarity reasoning, local and global similarity features are exploited by the bidirectional alignment of image and text, and the similarity reasoning is implemented among multi-granularity similarity features. Finally, refined local and global similarity features are adaptively fused to get a precise similarity score. The experimental results show that our effective model outperforms some state-of-the-art approaches, achieving average improvements of 2.5% and 6.3% in R@sum on the Flickr30K and MS-COCO dataset.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"652-665"},"PeriodicalIF":1.5,"publicationDate":"2024-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12270","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140483593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. OmDet, a novel language-aware object detection architecture and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training is introduced. Leveraging natural language as a universal knowledge representation, OmDet accumulates “visual vocabularies” from diverse datasets, unifying the task as a language-conditioned detection framework. The multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. The authors demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing.
{"title":"OmDet: Large-scale vision-language multi-dataset pre-training with multimodal detection network","authors":"Tiancheng Zhao, Peng Liu, Kyusong Lee","doi":"10.1049/cvi2.12268","DOIUrl":"10.1049/cvi2.12268","url":null,"abstract":"<p>The advancement of object detection (OD) in open-vocabulary and open-world scenarios is a critical challenge in computer vision. OmDet, a novel language-aware object detection architecture and an innovative training mechanism that harnesses continual learning and multi-dataset vision-language pre-training is introduced. Leveraging natural language as a universal knowledge representation, OmDet accumulates “visual vocabularies” from diverse datasets, unifying the task as a language-conditioned detection framework. The multimodal detection network (MDN) overcomes the challenges of multi-dataset joint training and generalizes to numerous training datasets without manual label taxonomy merging. The authors demonstrate superior performance of OmDet over strong baselines in object detection in the wild, open-vocabulary detection, and phrase grounding, achieving state-of-the-art results. Ablation studies reveal the impact of scaling the pre-training visual vocabulary, indicating a promising direction for further expansion to larger datasets. The effectiveness of our deep fusion approach is underscored by its ability to learn jointly from multiple datasets, enhancing performance through knowledge sharing.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"626-639"},"PeriodicalIF":1.5,"publicationDate":"2024-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12268","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139601188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
3D object detection technology from point clouds has been widely applied in the field of automatic driving in recent years. In practical applications, the shape point clouds of some objects are incomplete due to occlusion or far distance, which means they suffer from insufficient structural information. This greatly affects the detection performance. To address this challenge, the authors design a Structural Information Augment (SIA) Network for 3D object detection, named SIANet. Specifically, the authors design a SIA module to reconstruct the complete shapes of objects within proposals for enhancing their geometric features, which are further fused into the spatial feature of the object for box refinement to predict accurate detection boxes. Besides, the authors construct a novel Unet-liked Context-enhanced Transformer backbone network, which stacks Context-enhanced Transformer modules and an upsampling branch to capture contextual information efficiently and generate high-quality proposals for the SIA module. Extensive experiments show that the authors’ well-designed SIANet can effectively improve detection performance, especially surpassing the baseline network by 1.04% mean Average Precision (mAP) gain in the KITTI dataset and 0.75% LEVEL_2 mAP gain in the Waymo dataset.
{"title":"SIANet: 3D object detection with structural information augment network","authors":"Jing Zhou, Tengxing Lin, Zixin Gong, Xinhan Huang","doi":"10.1049/cvi2.12272","DOIUrl":"10.1049/cvi2.12272","url":null,"abstract":"<p>3D object detection technology from point clouds has been widely applied in the field of automatic driving in recent years. In practical applications, the shape point clouds of some objects are incomplete due to occlusion or far distance, which means they suffer from insufficient structural information. This greatly affects the detection performance. To address this challenge, the authors design a Structural Information Augment (SIA) Network for 3D object detection, named SIANet. Specifically, the authors design a SIA module to reconstruct the complete shapes of objects within proposals for enhancing their geometric features, which are further fused into the spatial feature of the object for box refinement to predict accurate detection boxes. Besides, the authors construct a novel Unet-liked Context-enhanced Transformer backbone network, which stacks Context-enhanced Transformer modules and an upsampling branch to capture contextual information efficiently and generate high-quality proposals for the SIA module. Extensive experiments show that the authors’ well-designed SIANet can effectively improve detection performance, especially surpassing the baseline network by 1.04% mean Average Precision (mAP) gain in the KITTI dataset and 0.75% LEVEL_2 mAP gain in the Waymo dataset.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"682-695"},"PeriodicalIF":1.5,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12272","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139604878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent studies have demonstrated that finely tuned deep neural networks (DNNs) are susceptible to adversarial attacks. Conventional physical attacks employ stickers as perturbations, achieving robust adversarial effects but compromising stealthiness. Recent innovations utilise light beams, such as lasers and projectors, for perturbation generation, allowing for stealthy physical attacks at the expense of robustness. In pursuit of implementing both stealthy and robust physical attacks, the authors present an adversarial catoptric light (AdvCL). This method leverages the natural phenomenon of catoptric light to generate perturbations that are both natural and stealthy. AdvCL first formalises the physical parameters of catoptric light and then optimises these parameters using a genetic algorithm to derive the most adversarial perturbation. Finally, the perturbations are deployed in the physical scene to execute stealthy and robust attacks. The proposed method is evaluated across three dimensions: effectiveness, stealthiness, and robustness. Quantitative results obtained in simulated environments demonstrate the efficacy of the proposed method, achieving an attack success rate of 83.5%, surpassing the baseline. The authors utilise common catoptric light as a perturbation to enhance the method's stealthiness, rendering physical samples more natural in appearance. Robustness is affirmed by successfully attacking advanced DNNs with a success rate exceeding 80% in all cases. Additionally, the authors discuss defence strategies against AdvCL and introduce some light-based physical attacks.
{"title":"Adversarial catoptric light: An effective, stealthy and robust physical-world attack to DNNs","authors":"Chengyin Hu, Weiwen Shi, Ling Tian, Wen Li","doi":"10.1049/cvi2.12264","DOIUrl":"10.1049/cvi2.12264","url":null,"abstract":"<p>Recent studies have demonstrated that finely tuned deep neural networks (DNNs) are susceptible to adversarial attacks. Conventional physical attacks employ stickers as perturbations, achieving robust adversarial effects but compromising stealthiness. Recent innovations utilise light beams, such as lasers and projectors, for perturbation generation, allowing for stealthy physical attacks at the expense of robustness. In pursuit of implementing both stealthy and robust physical attacks, the authors present an adversarial catoptric light (AdvCL). This method leverages the natural phenomenon of catoptric light to generate perturbations that are both natural and stealthy. AdvCL first formalises the physical parameters of catoptric light and then optimises these parameters using a genetic algorithm to derive the most adversarial perturbation. Finally, the perturbations are deployed in the physical scene to execute stealthy and robust attacks. The proposed method is evaluated across three dimensions: effectiveness, stealthiness, and robustness. Quantitative results obtained in simulated environments demonstrate the efficacy of the proposed method, achieving an attack success rate of 83.5%, surpassing the baseline. The authors utilise common catoptric light as a perturbation to enhance the method's stealthiness, rendering physical samples more natural in appearance. Robustness is affirmed by successfully attacking advanced DNNs with a success rate exceeding 80% in all cases. Additionally, the authors discuss defence strategies against AdvCL and introduce some light-based physical attacks.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"557-573"},"PeriodicalIF":1.5,"publicationDate":"2024-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12264","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139614963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The multifaceted nature of sensor data has long been a hurdle for those seeking to harness its full potential in the field of 3D object detection. Although the utilisation of point clouds as input has yielded exceptional results, the challenge of effectively combining the complementary properties of multi-sensor data looms large. This work presents a new approach to multi-model 3D object detection, called adaptive voxel-image feature fusion (AVIFF). Adaptive voxel-image feature fusion is an end-to-end single-shot framework that can dynamically and adaptively fuse point cloud and image features, resulting in a more comprehensive and integrated analysis of the camera sensor and the LiDar sensor data. With the aid of the adaptive feature fusion module, spatialised image features can be adroitly fused with voxel-based point cloud features, while the Dense Fusion module ensures the preservation of the distinctive characteristics of 3D point cloud data through the use of a heterogeneous architecture. Notably, the authors’ framework features a novel generalised intersection over union loss function that enhances the perceptibility of object localsation and rotation in 3D space. Comprehensive experimentation has validated the efficacy of the authors’ proposed modules, firmly establishing AVIFF as a novel framework in the field of 3D object detection.
{"title":"A novel multi-model 3D object detection framework with adaptive voxel-image feature fusion","authors":"Zhao Liu, Zhongliang Fu, Gang Li, Shengyuan Zhang","doi":"10.1049/cvi2.12269","DOIUrl":"10.1049/cvi2.12269","url":null,"abstract":"<p>The multifaceted nature of sensor data has long been a hurdle for those seeking to harness its full potential in the field of 3D object detection. Although the utilisation of point clouds as input has yielded exceptional results, the challenge of effectively combining the complementary properties of multi-sensor data looms large. This work presents a new approach to multi-model 3D object detection, called adaptive voxel-image feature fusion (AVIFF). Adaptive voxel-image feature fusion is an end-to-end single-shot framework that can dynamically and adaptively fuse point cloud and image features, resulting in a more comprehensive and integrated analysis of the camera sensor and the LiDar sensor data. With the aid of the adaptive feature fusion module, spatialised image features can be adroitly fused with voxel-based point cloud features, while the Dense Fusion module ensures the preservation of the distinctive characteristics of 3D point cloud data through the use of a heterogeneous architecture. Notably, the authors’ framework features a novel generalised intersection over union loss function that enhances the perceptibility of object localsation and rotation in 3D space. Comprehensive experimentation has validated the efficacy of the authors’ proposed modules, firmly establishing AVIFF as a novel framework in the field of 3D object detection.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"640-651"},"PeriodicalIF":1.5,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12269","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139616930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haifeng Sima, Bailiang Chen, Chaosheng Tang, Yudong Zhang, Junding Sun
X-ray security checks aim to detect contraband in luggage; however, the detection accuracy is hindered by the overlapping and significant size differences of objects in X-ray images. To address these challenges, the authors introduce a novel network model named Multi-Scale Feature Attention (MSFA)-DEtection TRansformer (DETR). Firstly, the pyramid feature extraction structure is embedded into the self-attention module, referred to as the MSFA. Leveraging the MSFA module, MSFA-DETR extracts multi-scale feature information and amalgamates them into high-level semantic features. Subsequently, these features are synergised through attention mechanisms to capture correlations between global information and multi-scale features. MSFA significantly bolsters the model's robustness across different sizes, thereby enhancing detection accuracy. Simultaneously, A new initialisation method for object queries is proposed. The authors’ foreground sequence extraction (FSE) module extracts key feature sequences from feature maps, serving as prior knowledge for object queries. FSE expedites the convergence of the DETR model and elevates detection accuracy. Extensive experimentation validates that this proposed model surpasses state-of-the-art methods on the CLCXray and PIDray datasets.
X 射线安全检查的目的是检测行李中的违禁品;然而,由于 X 射线图像中物体的重叠和显著的尺寸差异,检测的准确性受到了影响。为了应对这些挑战,作者引入了一种名为多尺度特征注意(MSFA)-DEtection TRansformer(DETR)的新型网络模型。首先,将金字塔特征提取结构嵌入自我注意模块,称为 MSFA。利用 MSFA 模块,MSFA-DETR 可提取多尺度特征信息,并将其合并为高级语义特征。随后,这些特征通过注意力机制协同作用,以捕捉全局信息和多尺度特征之间的相关性。MSFA 极大地增强了模型在不同尺寸下的鲁棒性,从而提高了检测精度。同时,还提出了一种新的对象查询初始化方法。作者的前景序列提取(FSE)模块从特征图中提取关键特征序列,作为对象查询的先验知识。FSE 加快了 DETR 模型的收敛速度,提高了检测精度。广泛的实验验证了所提出的模型在 CLCXray 和 PIDray 数据集上超越了最先进的方法。
{"title":"Multi-Scale Feature Attention-DEtection TRansformer: Multi-Scale Feature Attention for security check object detection","authors":"Haifeng Sima, Bailiang Chen, Chaosheng Tang, Yudong Zhang, Junding Sun","doi":"10.1049/cvi2.12267","DOIUrl":"10.1049/cvi2.12267","url":null,"abstract":"<p>X-ray security checks aim to detect contraband in luggage; however, the detection accuracy is hindered by the overlapping and significant size differences of objects in X-ray images. To address these challenges, the authors introduce a novel network model named Multi-Scale Feature Attention (MSFA)-DEtection TRansformer (DETR). Firstly, the pyramid feature extraction structure is embedded into the self-attention module, referred to as the MSFA. Leveraging the MSFA module, MSFA-DETR extracts multi-scale feature information and amalgamates them into high-level semantic features. Subsequently, these features are synergised through attention mechanisms to capture correlations between global information and multi-scale features. MSFA significantly bolsters the model's robustness across different sizes, thereby enhancing detection accuracy. Simultaneously, A new initialisation method for object queries is proposed. The authors’ foreground sequence extraction (FSE) module extracts key feature sequences from feature maps, serving as prior knowledge for object queries. FSE expedites the convergence of the DETR model and elevates detection accuracy. Extensive experimentation validates that this proposed model surpasses state-of-the-art methods on the CLCXray and PIDray datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"613-625"},"PeriodicalIF":1.5,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12267","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139620312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adversarial training suffers from poor effectiveness due to the challenging optimisation of loss with hard labels. To address this issue, adversarial distillation has emerged as a potential solution, encouraging target models to mimic the output of the teachers. However, reliance on pre-training teachers leads to additional training costs and raises concerns about the reliability of their knowledge. Furthermore, existing methods fail to consider the significant differences in unconfident samples between early and late stages, potentially resulting in robust overfitting. An adversarial defence method named Clean, Performance-robust, and Performance-sensitive Historical Information based Adversarial Self-Distillation (CPr & PsHI-ASD) is presented. Firstly, an adversarial self-distillation replacement method based on clean, performance-robust, and performance-sensitive historical information is developed to eliminate pre-training costs and enhance guidance reliability for the target model. Secondly, adversarial self-distillation algorithms that leverage knowledge distilled from the previous iteration are introduced to facilitate the self-distillation of adversarial knowledge and mitigate the problem of robust overfitting. Experiments are conducted to evaluate the performance of the proposed method on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. The results demonstrate that the CPr&PsHI-ASD method is more effective than existing adversarial distillation methods in enhancing adversarial robustness and mitigating robust overfitting issues against various adversarial attacks.
{"title":"Clean, performance-robust, and performance-sensitive historical information based adversarial self-distillation","authors":"Shuyi Li, Hongchao Hu, Shumin Huo, Hao Liang","doi":"10.1049/cvi2.12265","DOIUrl":"10.1049/cvi2.12265","url":null,"abstract":"<p>Adversarial training suffers from poor effectiveness due to the challenging optimisation of loss with hard labels. To address this issue, adversarial distillation has emerged as a potential solution, encouraging target models to mimic the output of the teachers. However, reliance on pre-training teachers leads to additional training costs and raises concerns about the reliability of their knowledge. Furthermore, existing methods fail to consider the significant differences in unconfident samples between early and late stages, potentially resulting in robust overfitting. An adversarial defence method named Clean, Performance-robust, and Performance-sensitive Historical Information based Adversarial Self-Distillation (CPr & PsHI-ASD) is presented. Firstly, an adversarial self-distillation replacement method based on clean, performance-robust, and performance-sensitive historical information is developed to eliminate pre-training costs and enhance guidance reliability for the target model. Secondly, adversarial self-distillation algorithms that leverage knowledge distilled from the previous iteration are introduced to facilitate the self-distillation of adversarial knowledge and mitigate the problem of robust overfitting. Experiments are conducted to evaluate the performance of the proposed method on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. The results demonstrate that the CPr&PsHI-ASD method is more effective than existing adversarial distillation methods in enhancing adversarial robustness and mitigating robust overfitting issues against various adversarial attacks.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"591-612"},"PeriodicalIF":1.5,"publicationDate":"2024-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12265","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139446540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In response to the challenges of Multi-Object Tracking (MOT) in sports scenes, such as severe occlusions, similar appearances, drastic pose changes, and complex motion patterns, a deep-learning framework CTGMOT (CNN-Transformer-GNN-based MOT) specifically for multiple athlete tracking in sports videos that performs joint modelling of detection, appearance and motion features is proposed. Firstly, a detection network that combines Convolutional Neural Networks (CNN) and Transformers is constructed to extract both local and global features from images. The fusion of appearance and motion features is achieved through a design of parallel dual-branch decoders. Secondly, graph models are built using Graph Neural Networks (GNN) to accurately capture the spatio-temporal correlations between object and trajectory features from inter-frame and intra-frame associations. Experimental results on the public sports tracking dataset SportsMOT show that the proposed framework outperforms other state-of-the-art methods for MOT in complex sport scenes. In addition, the proposed framework shows excellent generality on benchmark datasets MOT17 and MOT20.
{"title":"A deep learning framework for multi-object tracking in team sports videos","authors":"Wei Cao, Xiaoyong Wang, Xianxiang Liu, Yishuai Xu","doi":"10.1049/cvi2.12266","DOIUrl":"10.1049/cvi2.12266","url":null,"abstract":"<p>In response to the challenges of Multi-Object Tracking (MOT) in sports scenes, such as severe occlusions, similar appearances, drastic pose changes, and complex motion patterns, a deep-learning framework CTGMOT (CNN-Transformer-GNN-based MOT) specifically for multiple athlete tracking in sports videos that performs joint modelling of detection, appearance and motion features is proposed. Firstly, a detection network that combines Convolutional Neural Networks (CNN) and Transformers is constructed to extract both local and global features from images. The fusion of appearance and motion features is achieved through a design of parallel dual-branch decoders. Secondly, graph models are built using Graph Neural Networks (GNN) to accurately capture the spatio-temporal correlations between object and trajectory features from inter-frame and intra-frame associations. Experimental results on the public sports tracking dataset SportsMOT show that the proposed framework outperforms other state-of-the-art methods for MOT in complex sport scenes. In addition, the proposed framework shows excellent generality on benchmark datasets MOT17 and MOT20.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 5","pages":"574-590"},"PeriodicalIF":1.5,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12266","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139453061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, the offline-trained Siamese pipeline has drawn wide attention due to its outstanding tracking performance. However, the existing Siamese trackers utilise offline training to extract ‘universal’ features, which is insufficient to effectively distinguish between the target and fluctuating interference in embedding the information of the two branches, leading to inaccurate classification and localisation. In addition, the Siamese trackers employ a pre-defined scale for cropping the search candidate region based on the previous frame's result, which might easily introduce redundant background noise (clutter, similar objects etc.), affecting the tracker's robustness. To solve these problems, the authors propose two novel sub-network spatial employed to spatial feature embedding for robust object tracking. Specifically, the proposed spatial remapping (SRM) network enhances the feature discrepancy between target and distractor categories by online remapping, and improves the discriminant ability of the tracker on the embedding space. The MAML is used to optimise the SRM network to ensure its adaptability to complex tracking scenarios. Moreover, a temporal information proposal-guided (TPG) network that utilises a GRU model to dynamically predict the search scale based on temporal motion states to reduce potential background interference is introduced. The proposed two network is integrated into two popular trackers, namely SiamFC++ and TransT, which achieve superior performance on six challenging benchmarks, including OTB100, VOT2019, UAV123, GOT10K, TrackingNet and LaSOT, TrackingNet and LaSOT denoting them as SiamSRMC and SiamSRMT, respectively. Moreover, the proposed trackers obtain competitive tracking performance compared with the state-of-the-art trackers in the attribute of background clutter and similar object, validating the effectiveness of our method.
{"title":"Spatial feature embedding for robust visual object tracking","authors":"Kang Liu, Long Liu, Shangqi Yang, Zhihao Fu","doi":"10.1049/cvi2.12263","DOIUrl":"10.1049/cvi2.12263","url":null,"abstract":"<p>Recently, the offline-trained Siamese pipeline has drawn wide attention due to its outstanding tracking performance. However, the existing Siamese trackers utilise offline training to extract ‘universal’ features, which is insufficient to effectively distinguish between the target and fluctuating interference in embedding the information of the two branches, leading to inaccurate classification and localisation. In addition, the Siamese trackers employ a pre-defined scale for cropping the search candidate region based on the previous frame's result, which might easily introduce redundant background noise (clutter, similar objects etc.), affecting the tracker's robustness. To solve these problems, the authors propose two novel sub-network spatial employed to spatial feature embedding for robust object tracking. Specifically, the proposed spatial remapping (SRM) network enhances the feature discrepancy between target and distractor categories by online remapping, and improves the discriminant ability of the tracker on the embedding space. The MAML is used to optimise the SRM network to ensure its adaptability to complex tracking scenarios. Moreover, a temporal information proposal-guided (TPG) network that utilises a GRU model to dynamically predict the search scale based on temporal motion states to reduce potential background interference is introduced. The proposed two network is integrated into two popular trackers, namely SiamFC++ and TransT, which achieve superior performance on six challenging benchmarks, including OTB100, VOT2019, UAV123, GOT10K, TrackingNet and LaSOT, TrackingNet and LaSOT denoting them as SiamSRMC and SiamSRMT, respectively. Moreover, the proposed trackers obtain competitive tracking performance compared with the state-of-the-art trackers in the attribute of background clutter and similar object, validating the effectiveness of our method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"540-556"},"PeriodicalIF":1.7,"publicationDate":"2023-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12263","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138954945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, many methods for image super-resolution (SR) have relied on pairs of low-resolution (LR) and high-resolution (HR) images for training, where the degradation process is predefined by bicubic downsampling. While such approaches perform well in standard benchmark tests, they often fail to accurately replicate the complexity of real-world image degradation. To address this challenge, researchers have proposed the use of unpaired image training to implicitly model the degradation process. However, there is a significant domain gap between the real-world LR and the synthetic LR images from HR, which severely degrades the SR performance. A novel unsupervised image-blind super-resolution method that exploits degradation feature-based learning for real-image super-resolution reconstruction (RDFL) is proposed. Their approach learns the degradation process from HR to LR using a generative adversarial network (GAN) and constrains the data distribution of the synthetic LR with real degraded images. The authors then encode the degraded features into a Transformer-based SR network for image super-resolution reconstruction through degradation representation learning. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness and superiority of the RDFL method, which achieves visually pleasing reconstruction results.
近年来,许多图像超分辨率(SR)方法都依赖于成对的低分辨率(LR)和高分辨率(HR)图像进行训练,其中降解过程是通过双三次降采样预先确定的。虽然这些方法在标准基准测试中表现良好,但往往无法准确复制真实世界图像降解的复杂性。为了应对这一挑战,研究人员提出了使用非配对图像训练来隐式模拟退化过程的方法。然而,真实世界的 LR 图像与来自 HR 的合成 LR 图像之间存在明显的域差距,这严重降低了 SR 性能。有人提出了一种新颖的无监督图像盲超分辨方法,利用基于降解特征的学习进行真实图像超分辨重建(RDFL)。他们的方法利用生成式对抗网络(GAN)学习从 HR 到 LR 的降解过程,并用真实降解图像约束合成 LR 的数据分布。然后,作者将降解特征编码到基于变换器的 SR 网络中,通过降解表示学习进行图像超分辨率重建。在合成数据集和真实数据集上进行的大量实验证明了 RDFL 方法的有效性和优越性,并取得了视觉上令人愉悦的重建结果。
{"title":"Unsupervised image blind super resolution via real degradation feature learning","authors":"Cheng Yang, Guanming Lu","doi":"10.1049/cvi2.12262","DOIUrl":"10.1049/cvi2.12262","url":null,"abstract":"<p>In recent years, many methods for image super-resolution (SR) have relied on pairs of low-resolution (LR) and high-resolution (HR) images for training, where the degradation process is predefined by bicubic downsampling. While such approaches perform well in standard benchmark tests, they often fail to accurately replicate the complexity of real-world image degradation. To address this challenge, researchers have proposed the use of unpaired image training to implicitly model the degradation process. However, there is a significant domain gap between the real-world LR and the synthetic LR images from HR, which severely degrades the SR performance. A novel unsupervised image-blind super-resolution method that exploits degradation feature-based learning for real-image super-resolution reconstruction (RDFL) is proposed. Their approach learns the degradation process from HR to LR using a generative adversarial network (GAN) and constrains the data distribution of the synthetic LR with real degraded images. The authors then encode the degraded features into a Transformer-based SR network for image super-resolution reconstruction through degradation representation learning. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness and superiority of the RDFL method, which achieves visually pleasing reconstruction results.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"485-498"},"PeriodicalIF":1.7,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12262","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139001043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}