Hsin-Chun Tsai, Nan-Han Lu, Kuo-Ying Liu, Chuan-Han Lin, Jhing-Fa Wang
Convolutional deep learning models have shown comparable performance to radiologists in detecting and classifying thoracic diseases. However, research on rib fractures remains limited compared to other thoracic abnormalities. Moreover, existing deep learning models primarily focus on using frontal chest X-ray (CXR) images. To address these gaps, the authors utilised the EDARib-CXR dataset, comprising 369 frontal and 829 oblique CXRs. These X-rays were annotated by experienced radiologists, specifically identifying the presence of rib fractures using bounding-box-level annotations. The authors introduce two detection models, AB-YOLOv5 and PB-YOLOv5, and train and evaluate them on the EDARib-CXR dataset. AB-YOLOv5 is a modified YOLOv5 network that incorporates an auxiliary branch to enhance the resolution of feature maps in the final convolutional network layer. On the other hand, PB-YOLOv5 maintains the same structure as the original YOLOv5 but employs image patches during training to preserve features of small objects in downsampled images. Furthermore, the authors propose a novel two-level cascaded architecture that integrates both AB-YOLOv5 and PB-YOLOv5 detection models. This structure demonstrates improved metrics on the test set, achieving an AP30 score of 0.785. Consequently, the study successfully develops deep learning-based detectors capable of identifying and localising fractured ribs in both frontal and oblique CXR images.
{"title":"Cascading AB-YOLOv5 and PB-YOLOv5 for rib fracture detection in frontal and oblique chest X-ray images","authors":"Hsin-Chun Tsai, Nan-Han Lu, Kuo-Ying Liu, Chuan-Han Lin, Jhing-Fa Wang","doi":"10.1049/cvi2.12239","DOIUrl":"https://doi.org/10.1049/cvi2.12239","url":null,"abstract":"<p>Convolutional deep learning models have shown comparable performance to radiologists in detecting and classifying thoracic diseases. However, research on rib fractures remains limited compared to other thoracic abnormalities. Moreover, existing deep learning models primarily focus on using frontal chest X-ray (CXR) images. To address these gaps, the authors utilised the EDARib-CXR dataset, comprising 369 frontal and 829 oblique CXRs. These X-rays were annotated by experienced radiologists, specifically identifying the presence of rib fractures using bounding-box-level annotations. The authors introduce two detection models, AB-YOLOv5 and PB-YOLOv5, and train and evaluate them on the EDARib-CXR dataset. AB-YOLOv5 is a modified YOLOv5 network that incorporates an auxiliary branch to enhance the resolution of feature maps in the final convolutional network layer. On the other hand, PB-YOLOv5 maintains the same structure as the original YOLOv5 but employs image patches during training to preserve features of small objects in downsampled images. Furthermore, the authors propose a novel two-level cascaded architecture that integrates both AB-YOLOv5 and PB-YOLOv5 detection models. This structure demonstrates improved metrics on the test set, achieving an AP30 score of 0.785. Consequently, the study successfully develops deep learning-based detectors capable of identifying and localising fractured ribs in both frontal and oblique CXR images.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"750-762"},"PeriodicalIF":1.7,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12239","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50146880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhijia Zhang, Yiming Shao, Ligang Wang, Haixing Li, Yunpeng Liu
The text in the natural scene can express rich semantic information, which helps people understand and analyse daily things. This paper focuses on the problems of discrete text spatial distribution and variable text geometric size in natural scenes with complex backgrounds and proposes an end-to-end natural scene text detection method based on DBNet. The authors first use IResNet as the backbone network, which does not increase network parameters while retaining more text features. Furthermore, a module with Transformer is introduced in the feature extraction stage to strengthen the correlation between high-level feature pixels. Then, the authors add a spatial pyramid pooling structure in the end of feature extraction, which realises the combination of local and global features, enriches the expressive ability of feature maps, and alleviates the detection limitations caused by the geometric size of features. Finally, to better integrate the features of each level, a dual attention module is embedded after multi-scale feature fusion. Extensive experiments on the MSRA-TD500, CTW1500, ICDAR2015, and MLT2017 data set are conducted. The results showed that IDBNet can improve the average precision, recall, and F-measure of a text compared with the state of art text detection methods and has higher predictive ability and practicability.
{"title":"IDBNet: Improved differentiable binarisation network for natural scene text detection","authors":"Zhijia Zhang, Yiming Shao, Ligang Wang, Haixing Li, Yunpeng Liu","doi":"10.1049/cvi2.12241","DOIUrl":"10.1049/cvi2.12241","url":null,"abstract":"<p>The text in the natural scene can express rich semantic information, which helps people understand and analyse daily things. This paper focuses on the problems of discrete text spatial distribution and variable text geometric size in natural scenes with complex backgrounds and proposes an end-to-end natural scene text detection method based on DBNet. The authors first use IResNet as the backbone network, which does not increase network parameters while retaining more text features. Furthermore, a module with Transformer is introduced in the feature extraction stage to strengthen the correlation between high-level feature pixels. Then, the authors add a spatial pyramid pooling structure in the end of feature extraction, which realises the combination of local and global features, enriches the expressive ability of feature maps, and alleviates the detection limitations caused by the geometric size of features. Finally, to better integrate the features of each level, a dual attention module is embedded after multi-scale feature fusion. Extensive experiments on the MSRA-TD500, CTW1500, ICDAR2015, and MLT2017 data set are conducted. The results showed that IDBNet can improve the average precision, recall, and F-measure of a text compared with the state of art text detection methods and has higher predictive ability and practicability.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"224-235"},"PeriodicalIF":1.7,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12241","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135425628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The position of vehicles is determined using an algorithm that includes two stages of detection and prediction. The more the number of frames in which the detection network is used, the more accurate the detector is, and the more the prediction network is used, the algorithm is faster. Therefore, the algorithm is very flexible to achieve the required accuracy and speed. YOLO's base detection network is designed to be robust against vehicle scale changes. Also, feature maps are produced in the detector network, which contribute greatly to increasing the accuracy of the detector. In these maps, using differential images and a u-net-based module, image segmentation has been done into two classes: vehicle and background. To increase the accuracy of the recursive predictive network, vehicle manoeuvres are classified. For this purpose, the spatial and temporal information of the vehicles are considered simultaneously. This classifier is much more effective than classifiers that consider spatial and temporal information separately. The Highway and UA-DETRAC datasets demonstrate the performance of the proposed algorithm in urban traffic monitoring systems.
确定车辆位置的算法包括检测和预测两个阶段。使用检测网络的帧数越多,检测就越准确,而使用预测网络的帧数越多,算法就越快。因此,该算法非常灵活,可以达到所需的精度和速度。YOLO 的基础检测网络在设计上对车辆尺度变化具有鲁棒性。此外,在检测器网络中还生成了特征图,这对提高检测器的准确性大有裨益。在这些地图中,利用差分图像和基于 U 网的模块,将图像分割为两类:车辆和背景。为了提高递归预测网络的准确性,对车辆的机动性进行了分类。为此,同时考虑了车辆的空间和时间信息。这种分类器比分别考虑空间和时间信息的分类器要有效得多。高速公路和 UA-DETRAC 数据集证明了所提算法在城市交通监控系统中的性能。
{"title":"Real-time vehicle detection using segmentation-based detection network and trajectory prediction","authors":"Nafiseh Zarei, Payman Moallem, Mohammadreza Shams","doi":"10.1049/cvi2.12236","DOIUrl":"10.1049/cvi2.12236","url":null,"abstract":"<p>The position of vehicles is determined using an algorithm that includes two stages of detection and prediction. The more the number of frames in which the detection network is used, the more accurate the detector is, and the more the prediction network is used, the algorithm is faster. Therefore, the algorithm is very flexible to achieve the required accuracy and speed. YOLO's base detection network is designed to be robust against vehicle scale changes. Also, feature maps are produced in the detector network, which contribute greatly to increasing the accuracy of the detector. In these maps, using differential images and a u-net-based module, image segmentation has been done into two classes: vehicle and background. To increase the accuracy of the recursive predictive network, vehicle manoeuvres are classified. For this purpose, the spatial and temporal information of the vehicles are considered simultaneously. This classifier is much more effective than classifiers that consider spatial and temporal information separately. The Highway and UA-DETRAC datasets demonstrate the performance of the proposed algorithm in urban traffic monitoring systems.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"191-209"},"PeriodicalIF":1.7,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12236","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135864622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cuihong Xue, Jingli Jia, Ming Yu, Gang Yan, Yingchun Guo, Yuehao Liu
With the goal of solving the problem of feature extractors lacking strong supervision training and insufficient time information concerning single-sequence model learning, a hierarchical sequence memory network with a multi-level iterative optimisation strategy is proposed for continuous sign language recognition. This method uses the spatial-temporal fusion convolution network (STFC-Net) to extract the spatial-temporal information of RGB and Optical flow video frames to obtain the multi-modal visual features of a sign language video. Then, in order to enhance the temporal relationships of visual feature maps, the hierarchical memory sequence network is used to capture local utterance features and global context dependencies across time dimensions to obtain sequence features. Finally, the decoder decodes the final sentence sequence. In order to enhance the feature extractor, the authors adopted a multi-level iterative optimisation strategy to fine-tune STFC-Net and the utterance feature extractor. The experimental results on the RWTH-Phoenix-Weather multi-signer 2014 dataset and the Chinese sign language dataset show the effectiveness and superiority of this method.
{"title":"Continuous sign language recognition based on hierarchical memory sequence network","authors":"Cuihong Xue, Jingli Jia, Ming Yu, Gang Yan, Yingchun Guo, Yuehao Liu","doi":"10.1049/cvi2.12240","DOIUrl":"10.1049/cvi2.12240","url":null,"abstract":"<p>With the goal of solving the problem of feature extractors lacking strong supervision training and insufficient time information concerning single-sequence model learning, a hierarchical sequence memory network with a multi-level iterative optimisation strategy is proposed for continuous sign language recognition. This method uses the spatial-temporal fusion convolution network (STFC-Net) to extract the spatial-temporal information of RGB and Optical flow video frames to obtain the multi-modal visual features of a sign language video. Then, in order to enhance the temporal relationships of visual feature maps, the hierarchical memory sequence network is used to capture local utterance features and global context dependencies across time dimensions to obtain sequence features. Finally, the decoder decodes the final sentence sequence. In order to enhance the feature extractor, the authors adopted a multi-level iterative optimisation strategy to fine-tune STFC-Net and the utterance feature extractor. The experimental results on the RWTH-Phoenix-Weather multi-signer 2014 dataset and the Chinese sign language dataset show the effectiveness and superiority of this method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"247-259"},"PeriodicalIF":1.7,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12240","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136062314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, online optimisation-based scene flow estimation has attracted significant attention due to its strong domain adaptivity. Although online optimisation-based methods have made significant advances, the performance is far from satisfactory as only flow priors are considered, neglecting scene priors that are crucial for the representations of dynamic scenes. To address this problem, the authors introduce a dual-branch MLP-based architecture to encode implicit scene representations from a source 3D point cloud, which can additionally synthesise a target 3D point cloud. Thus, the mapping function between the source and synthesised target 3D point clouds is established as an extra implicit regulariser to capture scene priors. Moreover, their model infers both flow and scene priors in a stronger bidirectional manner. It can effectively establish spatiotemporal constraints among the synthesised, source, and target 3D point clouds. Experiments on four challenging datasets, including KITTI scene flow, FlyingThings3D, Argoverse, and nuScenes, show that our method can achieve potential and comparable results, proving its effectiveness and generality.
{"title":"Scene flow estimation from 3D point clouds based on dual-branch implicit neural representations","authors":"Mingliang Zhai, Kang Ni, Jiucheng Xie, Hao Gao","doi":"10.1049/cvi2.12237","DOIUrl":"10.1049/cvi2.12237","url":null,"abstract":"<p>Recently, online optimisation-based scene flow estimation has attracted significant attention due to its strong domain adaptivity. Although online optimisation-based methods have made significant advances, the performance is far from satisfactory as only flow priors are considered, neglecting scene priors that are crucial for the representations of dynamic scenes. To address this problem, the authors introduce a dual-branch MLP-based architecture to encode implicit scene representations from a source 3D point cloud, which can additionally synthesise a target 3D point cloud. Thus, the mapping function between the source and synthesised target 3D point clouds is established as an extra implicit regulariser to capture scene priors. Moreover, their model infers both flow and scene priors in a stronger bidirectional manner. It can effectively establish spatiotemporal constraints among the synthesised, source, and target 3D point clouds. Experiments on four challenging datasets, including KITTI scene flow, FlyingThings3D, Argoverse, and nuScenes, show that our method can achieve potential and comparable results, proving its effectiveness and generality.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"210-223"},"PeriodicalIF":1.7,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12237","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135396850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minchao Ye, Chenglong Wang, Zhihao Meng, Fengchao Xiong, Yuntao Qian
Small-sample-size problem is always a challenge for hyperspectral image (HSI) classification. Considering the co-occurrence of land-cover classes between similar scenes, transfer learning can be performed, and cross-scene classification is deemed a feasible approach proposed in recent years. In cross-scene classification, the source scene which possesses sufficient labelled samples is used for assisting the classification of the target scene that has a few labelled samples. In most situations, different HSI scenes are imaged by different sensors resulting in their various input feature dimensions (i.e. number of bands), hence heterogeneous transfer learning is desired. An end-to-end heterogeneous transfer learning algorithm namely domain-invariant attention network (DIAN) is proposed to solve the cross-scene classification problem. The DIAN mainly contains two modules. (1) A feature-alignment CNN (FACNN) is applied to extract features from source and target scenes, respectively, aiming at projecting the heterogeneous features from two scenes into a shared low-dimensional subspace. (2) A domain-invariant attention block is developed to gain cross-domain consistency with a specially designed class-specific domain-invariance loss, thus further eliminating the domain shift. The experiments on two different cross-scene HSI datasets show that the proposed DIAN achieves satisfying classification results.
{"title":"Domain-invariant attention network for transfer learning between cross-scene hyperspectral images","authors":"Minchao Ye, Chenglong Wang, Zhihao Meng, Fengchao Xiong, Yuntao Qian","doi":"10.1049/cvi2.12238","DOIUrl":"https://doi.org/10.1049/cvi2.12238","url":null,"abstract":"<p>Small-sample-size problem is always a challenge for hyperspectral image (HSI) classification. Considering the co-occurrence of land-cover classes between similar scenes, transfer learning can be performed, and cross-scene classification is deemed a feasible approach proposed in recent years. In cross-scene classification, the source scene which possesses sufficient labelled samples is used for assisting the classification of the target scene that has a few labelled samples. In most situations, different HSI scenes are imaged by different sensors resulting in their various input feature dimensions (i.e. number of bands), hence heterogeneous transfer learning is desired. An end-to-end heterogeneous transfer learning algorithm namely domain-invariant attention network (DIAN) is proposed to solve the cross-scene classification problem. The DIAN mainly contains two modules. (1) A feature-alignment CNN (FACNN) is applied to extract features from source and target scenes, respectively, aiming at projecting the heterogeneous features from two scenes into a shared low-dimensional subspace. (2) A domain-invariant attention block is developed to gain cross-domain consistency with a specially designed class-specific domain-invariance loss, thus further eliminating the domain shift. The experiments on two different cross-scene HSI datasets show that the proposed DIAN achieves satisfying classification results.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"739-749"},"PeriodicalIF":1.7,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12238","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50151176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Compressive sensing provides a promising sampling paradigm for video acquisition for resource-limited sensor applications. However, the reconstruction of original video signals from sub-sampled measurements is still a great challenge. To exploit the temporal redundancies within videos during the recovery, previous works tend to perform alignment on initial reconstructions, which are too coarse to provide accurate motion estimations. To solve this problem, the authors propose a novel reconstruction network, named TSRN, for compressive video sensing. Specifically, the authors utilise a number of stacked temporal shift reconstruction blocks (TSRBs) to enhance the initial reconstruction progressively. Each TSRB could learn the temporal structures by exchanging information with last and next time step, and no additional computations is imposed on the network compared to regular 2D convolutions due to the high efficiency of temporal shift operations. After the enhancement, a bidirectional alignment module to build accurate temporal dependencies directly with the help of optical flows is employed. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations, thus yielding better performance. Experimental results verify the superiority of the proposed method over other state-of-the-art approaches quantitatively and qualitatively.
{"title":"A temporal shift reconstruction network for compressive video sensing","authors":"Zhenfei Gu, Chao Zhou, Guofeng Lin","doi":"10.1049/cvi2.12234","DOIUrl":"10.1049/cvi2.12234","url":null,"abstract":"<p>Compressive sensing provides a promising sampling paradigm for video acquisition for resource-limited sensor applications. However, the reconstruction of original video signals from sub-sampled measurements is still a great challenge. To exploit the temporal redundancies within videos during the recovery, previous works tend to perform alignment on initial reconstructions, which are too coarse to provide accurate motion estimations. To solve this problem, the authors propose a novel reconstruction network, named TSRN, for compressive video sensing. Specifically, the authors utilise a number of stacked temporal shift reconstruction blocks (TSRBs) to enhance the initial reconstruction progressively. Each TSRB could learn the temporal structures by exchanging information with last and next time step, and no additional computations is imposed on the network compared to regular 2D convolutions due to the high efficiency of temporal shift operations. After the enhancement, a bidirectional alignment module to build accurate temporal dependencies directly with the help of optical flows is employed. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations, thus yielding better performance. Experimental results verify the superiority of the proposed method over other state-of-the-art approaches quantitatively and qualitatively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"448-457"},"PeriodicalIF":1.7,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12234","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siamese-based trackers have demonstrated robust performance in object tracking, while Transformers have achieved widespread success in object detection. Currently, many researchers use a hybrid structure of convolutional neural networks and Transformers to design the backbone network of trackers, aiming to improve performance. However, this approach often underutilises the global feature extraction capability of Transformers. The authors propose a novel Transformer-based tracker that fuses spatial and temporal features. The tracker consists of a multilayer spatial feature fusion network (MSFFN), a temporal feature fusion network (TFFN), and a prediction head. The MSFFN includes two phases: feature extraction and feature fusion, and both phases are constructed with a Transformer. Compared with the hybrid structure of “CNNs + Transformer,” the proposed method enhances the continuity of feature extraction and the ability of information interaction between features, enabling comprehensive feature extraction. Moreover, to consider the temporal dimension, the authors propose a TFFN for updating the template image. The network utilises the Transformer to fuse the tracking results of multiple frames with the initial frame, allowing the template image to continuously incorporate more information and maintain the accuracy of target features. Extensive experiments show that the tracker STFT achieves state-of-the-art results on multiple benchmarks (OTB100, VOT2018, LaSOT, GOT-10K, and UAV123). Especially, the tracker STFT achieves remarkable area under the curve score of 0.652 and 0.706 on the LaSOT and OTB100 benchmark respectively.
{"title":"STFT: Spatial and temporal feature fusion for transformer tracker","authors":"Hao Zhang, Yan Piao, Nan Qi","doi":"10.1049/cvi2.12233","DOIUrl":"10.1049/cvi2.12233","url":null,"abstract":"<p>Siamese-based trackers have demonstrated robust performance in object tracking, while Transformers have achieved widespread success in object detection. Currently, many researchers use a hybrid structure of convolutional neural networks and Transformers to design the backbone network of trackers, aiming to improve performance. However, this approach often underutilises the global feature extraction capability of Transformers. The authors propose a novel Transformer-based tracker that fuses spatial and temporal features. The tracker consists of a multilayer spatial feature fusion network (MSFFN), a temporal feature fusion network (TFFN), and a prediction head. The MSFFN includes two phases: feature extraction and feature fusion, and both phases are constructed with a Transformer. Compared with the hybrid structure of “CNNs + Transformer,” the proposed method enhances the continuity of feature extraction and the ability of information interaction between features, enabling comprehensive feature extraction. Moreover, to consider the temporal dimension, the authors propose a TFFN for updating the template image. The network utilises the Transformer to fuse the tracking results of multiple frames with the initial frame, allowing the template image to continuously incorporate more information and maintain the accuracy of target features. Extensive experiments show that the tracker STFT achieves state-of-the-art results on multiple benchmarks (OTB100, VOT2018, LaSOT, GOT-10K, and UAV123). Especially, the tracker STFT achieves remarkable area under the curve score of 0.652 and 0.706 on the LaSOT and OTB100 benchmark respectively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"165-176"},"PeriodicalIF":1.7,"publicationDate":"2023-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12233","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42381518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple events in a long untrimmed video possess the characteristics of similarity and continuity. These characteristics can be considered as a kind of topic semantic information, which probably behaves as same sports, similar scenes, same objects etc. Inspired by this, a novel latent topic-aware network (LTNet) is proposed in this article. The LTNet explores potential themes within videos and generates more continuous captions. Firstly, a global visual topic finder is employed to detect the similarity among events and obtain latent topic-level features. Secondly, a latent topic-oriented relation learner is designed to further enhance the topic-level representations by capturing the relationship between each event and the video themes. Benefiting from the finder and the learner, the caption generator is capable of predicting more accurate and coherent descriptions. The effectiveness of our proposed method is demonstrated on ActivityNet Captions and YouCook2 datasets, where LTNet shows a relative performance of over 3.03% and 0.50% in CIDEr score respectively.
{"title":"A latent topic-aware network for dense video captioning","authors":"Tao Xu, Yuanyuan Cui, Xinyu He, Caihua Liu","doi":"10.1049/cvi2.12195","DOIUrl":"10.1049/cvi2.12195","url":null,"abstract":"<p>Multiple events in a long untrimmed video possess the characteristics of similarity and continuity. These characteristics can be considered as a kind of topic semantic information, which probably behaves as same sports, similar scenes, same objects etc. Inspired by this, a novel latent topic-aware network (LTNet) is proposed in this article. The LTNet explores potential themes within videos and generates more continuous captions. Firstly, a global visual topic finder is employed to detect the similarity among events and obtain latent topic-level features. Secondly, a latent topic-oriented relation learner is designed to further enhance the topic-level representations by capturing the relationship between each event and the video themes. Benefiting from the finder and the learner, the caption generator is capable of predicting more accurate and coherent descriptions. The effectiveness of our proposed method is demonstrated on ActivityNet Captions and YouCook2 datasets, where LTNet shows a relative performance of over 3.03% and 0.50% in CIDEr score respectively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"795-803"},"PeriodicalIF":1.7,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12195","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49048324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A novel video summarisation method called the Hyperbolic Graph Convolutional Network (HVSN) is proposed, which addresses the challenges of summarising edited videos and capturing the semantic consistency of video shots at different time points. Unlike existing methods that use linear video sequences as input, HVSN leverages Hyperbolic Graph Convolutional Networks (HGCNs) and an adaptive graph convolutional adjacency matrix network to learn and aggregate features from video shots. Moreover, a feature fusion mechanism based on the attention mechanism is employed to facilitate cross-module feature fusion. To evaluate the performance of the proposed method, experiments are conducted on two benchmark datasets, TVSum and SumMe. The results demonstrate that HVSN achieves state-of-the-art performance, with F1-scores of 62.04% and 50.26% on TVSum and SumMe, respectively. The use of HGCNs enables the model to better capture the complex spatial structures of video shots, and thus contributes to the improved performance of video summarisation.
{"title":"Feature fusion over hyperbolic graph convolution networks for video summarisation","authors":"GuangLi Wu, ShengTao Wang, ShiPeng Xu","doi":"10.1049/cvi2.12232","DOIUrl":"10.1049/cvi2.12232","url":null,"abstract":"<p>A novel video summarisation method called the Hyperbolic Graph Convolutional Network (HVSN) is proposed, which addresses the challenges of summarising edited videos and capturing the semantic consistency of video shots at different time points. Unlike existing methods that use linear video sequences as input, HVSN leverages Hyperbolic Graph Convolutional Networks (HGCNs) and an adaptive graph convolutional adjacency matrix network to learn and aggregate features from video shots. Moreover, a feature fusion mechanism based on the attention mechanism is employed to facilitate cross-module feature fusion. To evaluate the performance of the proposed method, experiments are conducted on two benchmark datasets, TVSum and SumMe. The results demonstrate that HVSN achieves state-of-the-art performance, with F1-scores of 62.04% and 50.26% on TVSum and SumMe, respectively. The use of HGCNs enables the model to better capture the complex spatial structures of video shots, and thus contributes to the improved performance of video summarisation.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"150-164"},"PeriodicalIF":1.7,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12232","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46730355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}