首页 > 最新文献

IET Computer Vision最新文献

英文 中文
Cascading AB-YOLOv5 and PB-YOLOv5 for rib fracture detection in frontal and oblique chest X-ray images 级联AB-YOLOv5和PB-YOLOv 5在胸部正位和斜位X射线图像中检测肋骨骨折
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-28 DOI: 10.1049/cvi2.12239
Hsin-Chun Tsai, Nan-Han Lu, Kuo-Ying Liu, Chuan-Han Lin, Jhing-Fa Wang

Convolutional deep learning models have shown comparable performance to radiologists in detecting and classifying thoracic diseases. However, research on rib fractures remains limited compared to other thoracic abnormalities. Moreover, existing deep learning models primarily focus on using frontal chest X-ray (CXR) images. To address these gaps, the authors utilised the EDARib-CXR dataset, comprising 369 frontal and 829 oblique CXRs. These X-rays were annotated by experienced radiologists, specifically identifying the presence of rib fractures using bounding-box-level annotations. The authors introduce two detection models, AB-YOLOv5 and PB-YOLOv5, and train and evaluate them on the EDARib-CXR dataset. AB-YOLOv5 is a modified YOLOv5 network that incorporates an auxiliary branch to enhance the resolution of feature maps in the final convolutional network layer. On the other hand, PB-YOLOv5 maintains the same structure as the original YOLOv5 but employs image patches during training to preserve features of small objects in downsampled images. Furthermore, the authors propose a novel two-level cascaded architecture that integrates both AB-YOLOv5 and PB-YOLOv5 detection models. This structure demonstrates improved metrics on the test set, achieving an AP30 score of 0.785. Consequently, the study successfully develops deep learning-based detectors capable of identifying and localising fractured ribs in both frontal and oblique CXR images.

卷积深度学习模型在检测和分类胸部疾病方面显示出与放射科医生相当的性能。然而,与其他胸部异常相比,对肋骨骨折的研究仍然有限。此外,现有的深度学习模型主要集中在使用正面胸部X射线(CXR)图像上。为了解决这些差距,作者使用了EDARib CXR数据集,包括369个正面CXR和829个倾斜CXR。这些X射线由经验丰富的放射科医生进行注释,特别是使用边界框级别的注释来识别肋骨骨折的存在。作者介绍了两种检测模型,AB-YOLOv5和PB-YOLOv 5,并在EDARib CXR数据集上对它们进行了训练和评估。AB-YOLOv5是一种改进的YOLOv5网络,它包含了一个辅助分支,以提高最终卷积网络层中特征图的分辨率。另一方面,PB-YOLOv5保持与原始YOLOv5相同的结构,但在训练期间使用图像补丁来保留下采样图像中的小对象的特征。此外,作者提出了一种新的两级级联结构,该结构集成了AB-YOLOv5和PB-YOLOv 5检测模型。该结构展示了测试集的改进指标,AP30得分为0.785。因此,该研究成功开发了基于深度学习的检测器,能够在正面和倾斜CXR图像中识别和定位肋骨骨折。
{"title":"Cascading AB-YOLOv5 and PB-YOLOv5 for rib fracture detection in frontal and oblique chest X-ray images","authors":"Hsin-Chun Tsai,&nbsp;Nan-Han Lu,&nbsp;Kuo-Ying Liu,&nbsp;Chuan-Han Lin,&nbsp;Jhing-Fa Wang","doi":"10.1049/cvi2.12239","DOIUrl":"https://doi.org/10.1049/cvi2.12239","url":null,"abstract":"<p>Convolutional deep learning models have shown comparable performance to radiologists in detecting and classifying thoracic diseases. However, research on rib fractures remains limited compared to other thoracic abnormalities. Moreover, existing deep learning models primarily focus on using frontal chest X-ray (CXR) images. To address these gaps, the authors utilised the EDARib-CXR dataset, comprising 369 frontal and 829 oblique CXRs. These X-rays were annotated by experienced radiologists, specifically identifying the presence of rib fractures using bounding-box-level annotations. The authors introduce two detection models, AB-YOLOv5 and PB-YOLOv5, and train and evaluate them on the EDARib-CXR dataset. AB-YOLOv5 is a modified YOLOv5 network that incorporates an auxiliary branch to enhance the resolution of feature maps in the final convolutional network layer. On the other hand, PB-YOLOv5 maintains the same structure as the original YOLOv5 but employs image patches during training to preserve features of small objects in downsampled images. Furthermore, the authors propose a novel two-level cascaded architecture that integrates both AB-YOLOv5 and PB-YOLOv5 detection models. This structure demonstrates improved metrics on the test set, achieving an AP30 score of 0.785. Consequently, the study successfully develops deep learning-based detectors capable of identifying and localising fractured ribs in both frontal and oblique CXR images.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"750-762"},"PeriodicalIF":1.7,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12239","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50146880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IDBNet: Improved differentiable binarisation network for natural scene text detection IDBNet:用于自然场景文本检测的改进型可微分二值化网络
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-28 DOI: 10.1049/cvi2.12241
Zhijia Zhang, Yiming Shao, Ligang Wang, Haixing Li, Yunpeng Liu

The text in the natural scene can express rich semantic information, which helps people understand and analyse daily things. This paper focuses on the problems of discrete text spatial distribution and variable text geometric size in natural scenes with complex backgrounds and proposes an end-to-end natural scene text detection method based on DBNet. The authors first use IResNet as the backbone network, which does not increase network parameters while retaining more text features. Furthermore, a module with Transformer is introduced in the feature extraction stage to strengthen the correlation between high-level feature pixels. Then, the authors add a spatial pyramid pooling structure in the end of feature extraction, which realises the combination of local and global features, enriches the expressive ability of feature maps, and alleviates the detection limitations caused by the geometric size of features. Finally, to better integrate the features of each level, a dual attention module is embedded after multi-scale feature fusion. Extensive experiments on the MSRA-TD500, CTW1500, ICDAR2015, and MLT2017 data set are conducted. The results showed that IDBNet can improve the average precision, recall, and F-measure of a text compared with the state of art text detection methods and has higher predictive ability and practicability.

自然场景中的文本可以表达丰富的语义信息,有助于人们理解和分析日常事物。本文针对复杂背景自然场景中文本空间分布离散、文本几何尺寸多变的问题,提出了一种基于 DBNet 的端到端自然场景文本检测方法。作者首先使用 IResNet 作为骨干网络,在不增加网络参数的同时保留了更多的文本特征。此外,在特征提取阶段引入了一个带有变换器的模块,以加强高级特征像素之间的相关性。然后,作者在特征提取的最后阶段加入了空间金字塔池化结构,实现了局部特征和全局特征的结合,丰富了特征图的表达能力,缓解了特征几何尺寸带来的检测限制。最后,为了更好地整合各层次的特征,在多尺度特征融合后嵌入了双重关注模块。在 MSRA-TD500、CTW1500、ICDAR2015 和 MLT2017 数据集上进行了广泛的实验。结果表明,与现有的文本检测方法相比,IDBNet 可以提高文本的平均精度、召回率和 F-measure,具有更高的预测能力和实用性。
{"title":"IDBNet: Improved differentiable binarisation network for natural scene text detection","authors":"Zhijia Zhang,&nbsp;Yiming Shao,&nbsp;Ligang Wang,&nbsp;Haixing Li,&nbsp;Yunpeng Liu","doi":"10.1049/cvi2.12241","DOIUrl":"10.1049/cvi2.12241","url":null,"abstract":"<p>The text in the natural scene can express rich semantic information, which helps people understand and analyse daily things. This paper focuses on the problems of discrete text spatial distribution and variable text geometric size in natural scenes with complex backgrounds and proposes an end-to-end natural scene text detection method based on DBNet. The authors first use IResNet as the backbone network, which does not increase network parameters while retaining more text features. Furthermore, a module with Transformer is introduced in the feature extraction stage to strengthen the correlation between high-level feature pixels. Then, the authors add a spatial pyramid pooling structure in the end of feature extraction, which realises the combination of local and global features, enriches the expressive ability of feature maps, and alleviates the detection limitations caused by the geometric size of features. Finally, to better integrate the features of each level, a dual attention module is embedded after multi-scale feature fusion. Extensive experiments on the MSRA-TD500, CTW1500, ICDAR2015, and MLT2017 data set are conducted. The results showed that IDBNet can improve the average precision, recall, and F-measure of a text compared with the state of art text detection methods and has higher predictive ability and practicability.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"224-235"},"PeriodicalIF":1.7,"publicationDate":"2023-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12241","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135425628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time vehicle detection using segmentation-based detection network and trajectory prediction 利用基于分割的检测网络和轨迹预测进行实时车辆检测
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-25 DOI: 10.1049/cvi2.12236
Nafiseh Zarei, Payman Moallem, Mohammadreza Shams

The position of vehicles is determined using an algorithm that includes two stages of detection and prediction. The more the number of frames in which the detection network is used, the more accurate the detector is, and the more the prediction network is used, the algorithm is faster. Therefore, the algorithm is very flexible to achieve the required accuracy and speed. YOLO's base detection network is designed to be robust against vehicle scale changes. Also, feature maps are produced in the detector network, which contribute greatly to increasing the accuracy of the detector. In these maps, using differential images and a u-net-based module, image segmentation has been done into two classes: vehicle and background. To increase the accuracy of the recursive predictive network, vehicle manoeuvres are classified. For this purpose, the spatial and temporal information of the vehicles are considered simultaneously. This classifier is much more effective than classifiers that consider spatial and temporal information separately. The Highway and UA-DETRAC datasets demonstrate the performance of the proposed algorithm in urban traffic monitoring systems.

确定车辆位置的算法包括检测和预测两个阶段。使用检测网络的帧数越多,检测就越准确,而使用预测网络的帧数越多,算法就越快。因此,该算法非常灵活,可以达到所需的精度和速度。YOLO 的基础检测网络在设计上对车辆尺度变化具有鲁棒性。此外,在检测器网络中还生成了特征图,这对提高检测器的准确性大有裨益。在这些地图中,利用差分图像和基于 U 网的模块,将图像分割为两类:车辆和背景。为了提高递归预测网络的准确性,对车辆的机动性进行了分类。为此,同时考虑了车辆的空间和时间信息。这种分类器比分别考虑空间和时间信息的分类器要有效得多。高速公路和 UA-DETRAC 数据集证明了所提算法在城市交通监控系统中的性能。
{"title":"Real-time vehicle detection using segmentation-based detection network and trajectory prediction","authors":"Nafiseh Zarei,&nbsp;Payman Moallem,&nbsp;Mohammadreza Shams","doi":"10.1049/cvi2.12236","DOIUrl":"10.1049/cvi2.12236","url":null,"abstract":"<p>The position of vehicles is determined using an algorithm that includes two stages of detection and prediction. The more the number of frames in which the detection network is used, the more accurate the detector is, and the more the prediction network is used, the algorithm is faster. Therefore, the algorithm is very flexible to achieve the required accuracy and speed. YOLO's base detection network is designed to be robust against vehicle scale changes. Also, feature maps are produced in the detector network, which contribute greatly to increasing the accuracy of the detector. In these maps, using differential images and a u-net-based module, image segmentation has been done into two classes: vehicle and background. To increase the accuracy of the recursive predictive network, vehicle manoeuvres are classified. For this purpose, the spatial and temporal information of the vehicles are considered simultaneously. This classifier is much more effective than classifiers that consider spatial and temporal information separately. The Highway and UA-DETRAC datasets demonstrate the performance of the proposed algorithm in urban traffic monitoring systems.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"191-209"},"PeriodicalIF":1.7,"publicationDate":"2023-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12236","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135864622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Continuous sign language recognition based on hierarchical memory sequence network 基于分层记忆序列网络的连续手语识别
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-22 DOI: 10.1049/cvi2.12240
Cuihong Xue, Jingli Jia, Ming Yu, Gang Yan, Yingchun Guo, Yuehao Liu

With the goal of solving the problem of feature extractors lacking strong supervision training and insufficient time information concerning single-sequence model learning, a hierarchical sequence memory network with a multi-level iterative optimisation strategy is proposed for continuous sign language recognition. This method uses the spatial-temporal fusion convolution network (STFC-Net) to extract the spatial-temporal information of RGB and Optical flow video frames to obtain the multi-modal visual features of a sign language video. Then, in order to enhance the temporal relationships of visual feature maps, the hierarchical memory sequence network is used to capture local utterance features and global context dependencies across time dimensions to obtain sequence features. Finally, the decoder decodes the final sentence sequence. In order to enhance the feature extractor, the authors adopted a multi-level iterative optimisation strategy to fine-tune STFC-Net and the utterance feature extractor. The experimental results on the RWTH-Phoenix-Weather multi-signer 2014 dataset and the Chinese sign language dataset show the effectiveness and superiority of this method.

为了解决特征提取器在单序列模型学习方面缺乏强监督训练和时间信息不足的问题,本文提出了一种具有多级迭代优化策略的分层序列记忆网络,用于连续手语识别。该方法利用时空融合卷积网络(STFC-Net)提取 RGB 和光流视频帧的时空信息,从而获得手语视频的多模态视觉特征。然后,为了增强视觉特征图的时间关系,使用分层记忆序列网络捕捉局部语篇特征和跨时间维度的全局上下文依赖关系,从而获得序列特征。最后,解码器对最终的句子序列进行解码。为了增强特征提取器,作者采用了多级迭代优化策略,对 STFC-Net 和语篇特征提取器进行了微调。在 RWTH-Phoenix-Weather 2014 多手语数据集和中文手语数据集上的实验结果表明了该方法的有效性和优越性。
{"title":"Continuous sign language recognition based on hierarchical memory sequence network","authors":"Cuihong Xue,&nbsp;Jingli Jia,&nbsp;Ming Yu,&nbsp;Gang Yan,&nbsp;Yingchun Guo,&nbsp;Yuehao Liu","doi":"10.1049/cvi2.12240","DOIUrl":"10.1049/cvi2.12240","url":null,"abstract":"<p>With the goal of solving the problem of feature extractors lacking strong supervision training and insufficient time information concerning single-sequence model learning, a hierarchical sequence memory network with a multi-level iterative optimisation strategy is proposed for continuous sign language recognition. This method uses the spatial-temporal fusion convolution network (STFC-Net) to extract the spatial-temporal information of RGB and Optical flow video frames to obtain the multi-modal visual features of a sign language video. Then, in order to enhance the temporal relationships of visual feature maps, the hierarchical memory sequence network is used to capture local utterance features and global context dependencies across time dimensions to obtain sequence features. Finally, the decoder decodes the final sentence sequence. In order to enhance the feature extractor, the authors adopted a multi-level iterative optimisation strategy to fine-tune STFC-Net and the utterance feature extractor. The experimental results on the RWTH-Phoenix-Weather multi-signer 2014 dataset and the Chinese sign language dataset show the effectiveness and superiority of this method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"247-259"},"PeriodicalIF":1.7,"publicationDate":"2023-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12240","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136062314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scene flow estimation from 3D point clouds based on dual-branch implicit neural representations 基于双分支隐式神经表征的三维点云场景流估计
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-15 DOI: 10.1049/cvi2.12237
Mingliang Zhai, Kang Ni, Jiucheng Xie, Hao Gao

Recently, online optimisation-based scene flow estimation has attracted significant attention due to its strong domain adaptivity. Although online optimisation-based methods have made significant advances, the performance is far from satisfactory as only flow priors are considered, neglecting scene priors that are crucial for the representations of dynamic scenes. To address this problem, the authors introduce a dual-branch MLP-based architecture to encode implicit scene representations from a source 3D point cloud, which can additionally synthesise a target 3D point cloud. Thus, the mapping function between the source and synthesised target 3D point clouds is established as an extra implicit regulariser to capture scene priors. Moreover, their model infers both flow and scene priors in a stronger bidirectional manner. It can effectively establish spatiotemporal constraints among the synthesised, source, and target 3D point clouds. Experiments on four challenging datasets, including KITTI scene flow, FlyingThings3D, Argoverse, and nuScenes, show that our method can achieve potential and comparable results, proving its effectiveness and generality.

最近,基于在线优化的场景流估计因其强大的领域适应性而备受关注。虽然基于在线优化的方法取得了重大进展,但由于只考虑了流量先验,忽略了对动态场景表征至关重要的场景先验,其性能远不能令人满意。为了解决这个问题,作者引入了一种基于双分支 MLP 的架构,从源三维点云编码隐式场景表示,该架构还能合成目标三维点云。因此,源三维点云和合成目标三维点云之间的映射函数被建立为额外的隐式正则器,以捕捉场景先验。此外,他们的模型还能以更强的双向方式推断流量和场景先验。它能有效地在合成、源和目标三维点云之间建立时空约束。在四个具有挑战性的数据集(包括 KITTI 场景流、FlyingThings3D、Argoverse 和 nuScenes)上进行的实验表明,我们的方法可以获得潜在的、可比较的结果,证明了它的有效性和通用性。
{"title":"Scene flow estimation from 3D point clouds based on dual-branch implicit neural representations","authors":"Mingliang Zhai,&nbsp;Kang Ni,&nbsp;Jiucheng Xie,&nbsp;Hao Gao","doi":"10.1049/cvi2.12237","DOIUrl":"10.1049/cvi2.12237","url":null,"abstract":"<p>Recently, online optimisation-based scene flow estimation has attracted significant attention due to its strong domain adaptivity. Although online optimisation-based methods have made significant advances, the performance is far from satisfactory as only flow priors are considered, neglecting scene priors that are crucial for the representations of dynamic scenes. To address this problem, the authors introduce a dual-branch MLP-based architecture to encode implicit scene representations from a source 3D point cloud, which can additionally synthesise a target 3D point cloud. Thus, the mapping function between the source and synthesised target 3D point clouds is established as an extra implicit regulariser to capture scene priors. Moreover, their model infers both flow and scene priors in a stronger bidirectional manner. It can effectively establish spatiotemporal constraints among the synthesised, source, and target 3D point clouds. Experiments on four challenging datasets, including KITTI scene flow, FlyingThings3D, Argoverse, and nuScenes, show that our method can achieve potential and comparable results, proving its effectiveness and generality.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 2","pages":"210-223"},"PeriodicalIF":1.7,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12237","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135396850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Domain-invariant attention network for transfer learning between cross-scene hyperspectral images 用于跨场景高光谱图像间转移学习的域不变注意力网络
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-15 DOI: 10.1049/cvi2.12238
Minchao Ye, Chenglong Wang, Zhihao Meng, Fengchao Xiong, Yuntao Qian

Small-sample-size problem is always a challenge for hyperspectral image (HSI) classification. Considering the co-occurrence of land-cover classes between similar scenes, transfer learning can be performed, and cross-scene classification is deemed a feasible approach proposed in recent years. In cross-scene classification, the source scene which possesses sufficient labelled samples is used for assisting the classification of the target scene that has a few labelled samples. In most situations, different HSI scenes are imaged by different sensors resulting in their various input feature dimensions (i.e. number of bands), hence heterogeneous transfer learning is desired. An end-to-end heterogeneous transfer learning algorithm namely domain-invariant attention network (DIAN) is proposed to solve the cross-scene classification problem. The DIAN mainly contains two modules. (1) A feature-alignment CNN (FACNN) is applied to extract features from source and target scenes, respectively, aiming at projecting the heterogeneous features from two scenes into a shared low-dimensional subspace. (2) A domain-invariant attention block is developed to gain cross-domain consistency with a specially designed class-specific domain-invariance loss, thus further eliminating the domain shift. The experiments on two different cross-scene HSI datasets show that the proposed DIAN achieves satisfying classification results.

小样本量问题一直是高光谱图像分类的一个挑战。考虑到相似场景之间土地覆盖类别的共存,可以进行迁移学习,跨场景分类被认为是近年来提出的一种可行的方法。在跨场景分类中,具有足够标记样本的源场景用于辅助具有少量标记样本的目标场景的分类。在大多数情况下,不同的HSI场景由不同的传感器成像,导致其不同的输入特征维度(即带的数量),因此需要异构迁移学习。针对跨场景分类问题,提出了一种端到端异构迁移学习算法,即域不变注意力网络(DIAN)。DIAN主要包含两个模块。(1) 应用特征对齐CNN(FACNN)分别从源场景和目标场景中提取特征,旨在将两个场景中的异构特征投影到共享的低维子空间中。(2) 开发了一个域不变注意力块,以获得跨域一致性和专门设计的类特定域不变损失,从而进一步消除域偏移。在两个不同的跨场景HSI数据集上的实验表明,所提出的DIAN获得了令人满意的分类结果。
{"title":"Domain-invariant attention network for transfer learning between cross-scene hyperspectral images","authors":"Minchao Ye,&nbsp;Chenglong Wang,&nbsp;Zhihao Meng,&nbsp;Fengchao Xiong,&nbsp;Yuntao Qian","doi":"10.1049/cvi2.12238","DOIUrl":"https://doi.org/10.1049/cvi2.12238","url":null,"abstract":"<p>Small-sample-size problem is always a challenge for hyperspectral image (HSI) classification. Considering the co-occurrence of land-cover classes between similar scenes, transfer learning can be performed, and cross-scene classification is deemed a feasible approach proposed in recent years. In cross-scene classification, the source scene which possesses sufficient labelled samples is used for assisting the classification of the target scene that has a few labelled samples. In most situations, different HSI scenes are imaged by different sensors resulting in their various input feature dimensions (i.e. number of bands), hence heterogeneous transfer learning is desired. An end-to-end heterogeneous transfer learning algorithm namely domain-invariant attention network (DIAN) is proposed to solve the cross-scene classification problem. The DIAN mainly contains two modules. (1) A feature-alignment CNN (FACNN) is applied to extract features from source and target scenes, respectively, aiming at projecting the heterogeneous features from two scenes into a shared low-dimensional subspace. (2) A domain-invariant attention block is developed to gain cross-domain consistency with a specially designed class-specific domain-invariance loss, thus further eliminating the domain shift. The experiments on two different cross-scene HSI datasets show that the proposed DIAN achieves satisfying classification results.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"739-749"},"PeriodicalIF":1.7,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12238","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50151176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A temporal shift reconstruction network for compressive video sensing 用于压缩视频传感的时移重构网络
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-09-09 DOI: 10.1049/cvi2.12234
Zhenfei Gu, Chao Zhou, Guofeng Lin

Compressive sensing provides a promising sampling paradigm for video acquisition for resource-limited sensor applications. However, the reconstruction of original video signals from sub-sampled measurements is still a great challenge. To exploit the temporal redundancies within videos during the recovery, previous works tend to perform alignment on initial reconstructions, which are too coarse to provide accurate motion estimations. To solve this problem, the authors propose a novel reconstruction network, named TSRN, for compressive video sensing. Specifically, the authors utilise a number of stacked temporal shift reconstruction blocks (TSRBs) to enhance the initial reconstruction progressively. Each TSRB could learn the temporal structures by exchanging information with last and next time step, and no additional computations is imposed on the network compared to regular 2D convolutions due to the high efficiency of temporal shift operations. After the enhancement, a bidirectional alignment module to build accurate temporal dependencies directly with the help of optical flows is employed. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations, thus yielding better performance. Experimental results verify the superiority of the proposed method over other state-of-the-art approaches quantitatively and qualitatively.

压缩传感为资源有限的传感器应用提供了一种前景广阔的视频采集采样范例。然而,从子采样测量中重建原始视频信号仍然是一个巨大的挑战。为了在恢复过程中利用视频中的时序冗余,以前的工作倾向于对初始重建进行对齐,而初始重建过于粗糙,无法提供准确的运动估计。为解决这一问题,作者提出了一种用于压缩视频传感的新型重建网络,命名为 TSRN。具体来说,作者利用一些堆叠的时移重建块(TSRB)来逐步增强初始重建。每个 TSRB 可以通过与上一个和下一个时间步交换信息来学习时间结构,由于时移操作的高效性,与普通的二维卷积相比,该网络无需进行额外的计算。增强后的双向配准模块可借助光流直接建立精确的时间依赖关系。与以往只从关键帧中提取补充信息的方法不同,所提出的配准模块可以通过双向传播从整个视频序列中接收时间信息,从而获得更好的性能。实验结果从定量和定性两方面验证了所提出的方法优于其他最先进的方法。
{"title":"A temporal shift reconstruction network for compressive video sensing","authors":"Zhenfei Gu,&nbsp;Chao Zhou,&nbsp;Guofeng Lin","doi":"10.1049/cvi2.12234","DOIUrl":"10.1049/cvi2.12234","url":null,"abstract":"<p>Compressive sensing provides a promising sampling paradigm for video acquisition for resource-limited sensor applications. However, the reconstruction of original video signals from sub-sampled measurements is still a great challenge. To exploit the temporal redundancies within videos during the recovery, previous works tend to perform alignment on initial reconstructions, which are too coarse to provide accurate motion estimations. To solve this problem, the authors propose a novel reconstruction network, named TSRN, for compressive video sensing. Specifically, the authors utilise a number of stacked temporal shift reconstruction blocks (TSRBs) to enhance the initial reconstruction progressively. Each TSRB could learn the temporal structures by exchanging information with last and next time step, and no additional computations is imposed on the network compared to regular 2D convolutions due to the high efficiency of temporal shift operations. After the enhancement, a bidirectional alignment module to build accurate temporal dependencies directly with the help of optical flows is employed. Different from previous methods that only extract supplementary information from the key frames, the proposed alignment module can receive temporal information from the whole video sequence via bidirectional propagations, thus yielding better performance. Experimental results verify the superiority of the proposed method over other state-of-the-art approaches quantitatively and qualitatively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"448-457"},"PeriodicalIF":1.7,"publicationDate":"2023-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12234","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136107495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STFT: Spatial and temporal feature fusion for transformer tracker STFT:变压器跟踪器的时空特征融合
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-31 DOI: 10.1049/cvi2.12233
Hao Zhang, Yan Piao, Nan Qi

Siamese-based trackers have demonstrated robust performance in object tracking, while Transformers have achieved widespread success in object detection. Currently, many researchers use a hybrid structure of convolutional neural networks and Transformers to design the backbone network of trackers, aiming to improve performance. However, this approach often underutilises the global feature extraction capability of Transformers. The authors propose a novel Transformer-based tracker that fuses spatial and temporal features. The tracker consists of a multilayer spatial feature fusion network (MSFFN), a temporal feature fusion network (TFFN), and a prediction head. The MSFFN includes two phases: feature extraction and feature fusion, and both phases are constructed with a Transformer. Compared with the hybrid structure of “CNNs + Transformer,” the proposed method enhances the continuity of feature extraction and the ability of information interaction between features, enabling comprehensive feature extraction. Moreover, to consider the temporal dimension, the authors propose a TFFN for updating the template image. The network utilises the Transformer to fuse the tracking results of multiple frames with the initial frame, allowing the template image to continuously incorporate more information and maintain the accuracy of target features. Extensive experiments show that the tracker STFT achieves state-of-the-art results on multiple benchmarks (OTB100, VOT2018, LaSOT, GOT-10K, and UAV123). Especially, the tracker STFT achieves remarkable area under the curve score of 0.652 and 0.706 on the LaSOT and OTB100 benchmark respectively.

基于暹罗的跟踪器在物体跟踪方面表现出了强大的性能,而变形金刚在物体检测方面取得了广泛的成功。目前,许多研究人员使用卷积神经网络和Transformers的混合结构来设计跟踪器的骨干网络,旨在提高性能。然而,这种方法往往没有充分利用Transformers的全局特征提取能力。作者提出了一种新的基于Transformer的跟踪器,该跟踪器融合了空间和时间特征。跟踪器由多层空间特征融合网络(MSFFN)、时间特征融合网络和预测头组成。MSFFN包括两个阶段:特征提取和特征融合,这两个阶段都是用Transformer构建的。与“CNNs+Transformer”的混合结构相比,该方法增强了特征提取的连续性和特征之间的信息交互能力,实现了全面的特征提取。此外,考虑到时间维度,作者提出了一种用于更新模板图像的TFFN。该网络利用Transformer将多个帧的跟踪结果与初始帧融合在一起,使模板图像能够持续包含更多信息,并保持目标特征的准确性。大量实验表明,跟踪器STFT在多个基准(OTB100、VOT2018、LaSOT、GOT‐10K和UAV123)上实现了最先进的结果。特别是,跟踪器STFT在LaSOT和OTB100基准上分别获得了0.652和0.706的显著曲线下面积分数。
{"title":"STFT: Spatial and temporal feature fusion for transformer tracker","authors":"Hao Zhang,&nbsp;Yan Piao,&nbsp;Nan Qi","doi":"10.1049/cvi2.12233","DOIUrl":"10.1049/cvi2.12233","url":null,"abstract":"<p>Siamese-based trackers have demonstrated robust performance in object tracking, while Transformers have achieved widespread success in object detection. Currently, many researchers use a hybrid structure of convolutional neural networks and Transformers to design the backbone network of trackers, aiming to improve performance. However, this approach often underutilises the global feature extraction capability of Transformers. The authors propose a novel Transformer-based tracker that fuses spatial and temporal features. The tracker consists of a multilayer spatial feature fusion network (MSFFN), a temporal feature fusion network (TFFN), and a prediction head. The MSFFN includes two phases: feature extraction and feature fusion, and both phases are constructed with a Transformer. Compared with the hybrid structure of “CNNs + Transformer,” the proposed method enhances the continuity of feature extraction and the ability of information interaction between features, enabling comprehensive feature extraction. Moreover, to consider the temporal dimension, the authors propose a TFFN for updating the template image. The network utilises the Transformer to fuse the tracking results of multiple frames with the initial frame, allowing the template image to continuously incorporate more information and maintain the accuracy of target features. Extensive experiments show that the tracker STFT achieves state-of-the-art results on multiple benchmarks (OTB100, VOT2018, LaSOT, GOT-10K, and UAV123). Especially, the tracker STFT achieves remarkable area under the curve score of 0.652 and 0.706 on the LaSOT and OTB100 benchmark respectively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"165-176"},"PeriodicalIF":1.7,"publicationDate":"2023-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12233","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42381518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A latent topic-aware network for dense video captioning 一种用于密集视频字幕的潜在主题感知网络
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-29 DOI: 10.1049/cvi2.12195
Tao Xu, Yuanyuan Cui, Xinyu He, Caihua Liu

Multiple events in a long untrimmed video possess the characteristics of similarity and continuity. These characteristics can be considered as a kind of topic semantic information, which probably behaves as same sports, similar scenes, same objects etc. Inspired by this, a novel latent topic-aware network (LTNet) is proposed in this article. The LTNet explores potential themes within videos and generates more continuous captions. Firstly, a global visual topic finder is employed to detect the similarity among events and obtain latent topic-level features. Secondly, a latent topic-oriented relation learner is designed to further enhance the topic-level representations by capturing the relationship between each event and the video themes. Benefiting from the finder and the learner, the caption generator is capable of predicting more accurate and coherent descriptions. The effectiveness of our proposed method is demonstrated on ActivityNet Captions and YouCook2 datasets, where LTNet shows a relative performance of over 3.03% and 0.50% in CIDEr score respectively.

长时间未剪辑视频中的多个事件具有相似性和连续性的特点。这些特征可以被视为一种主题语义信息,可能表现为相同的运动、相似的场景、相同的对象等。受此启发,本文提出了一种新的潜在主题感知网络(LTNet)。LTNet探索视频中的潜在主题,并生成更连续的字幕。首先,使用全局视觉主题查找器来检测事件之间的相似性,并获得潜在的主题级特征。其次,设计了一个潜在的主题导向关系学习器,通过捕捉每个事件和视频主题之间的关系,进一步增强主题层次的表征。得益于查找器和学习器,字幕生成器能够预测更准确和连贯的描述。我们提出的方法的有效性在ActivityNet字幕和YouCook2数据集上得到了验证,其中LTNet在CIDEr评分中分别显示出超过3.03%和0.50%的相对性能。
{"title":"A latent topic-aware network for dense video captioning","authors":"Tao Xu,&nbsp;Yuanyuan Cui,&nbsp;Xinyu He,&nbsp;Caihua Liu","doi":"10.1049/cvi2.12195","DOIUrl":"10.1049/cvi2.12195","url":null,"abstract":"<p>Multiple events in a long untrimmed video possess the characteristics of similarity and continuity. These characteristics can be considered as a kind of topic semantic information, which probably behaves as same sports, similar scenes, same objects etc. Inspired by this, a novel latent topic-aware network (LTNet) is proposed in this article. The LTNet explores potential themes within videos and generates more continuous captions. Firstly, a global visual topic finder is employed to detect the similarity among events and obtain latent topic-level features. Secondly, a latent topic-oriented relation learner is designed to further enhance the topic-level representations by capturing the relationship between each event and the video themes. Benefiting from the finder and the learner, the caption generator is capable of predicting more accurate and coherent descriptions. The effectiveness of our proposed method is demonstrated on ActivityNet Captions and YouCook2 datasets, where LTNet shows a relative performance of over 3.03% and 0.50% in CIDEr score respectively.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"795-803"},"PeriodicalIF":1.7,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12195","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49048324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature fusion over hyperbolic graph convolution networks for video summarisation 用于视频摘要的双曲图卷积网络特征融合
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-08-25 DOI: 10.1049/cvi2.12232
GuangLi Wu, ShengTao Wang, ShiPeng Xu

A novel video summarisation method called the Hyperbolic Graph Convolutional Network (HVSN) is proposed, which addresses the challenges of summarising edited videos and capturing the semantic consistency of video shots at different time points. Unlike existing methods that use linear video sequences as input, HVSN leverages Hyperbolic Graph Convolutional Networks (HGCNs) and an adaptive graph convolutional adjacency matrix network to learn and aggregate features from video shots. Moreover, a feature fusion mechanism based on the attention mechanism is employed to facilitate cross-module feature fusion. To evaluate the performance of the proposed method, experiments are conducted on two benchmark datasets, TVSum and SumMe. The results demonstrate that HVSN achieves state-of-the-art performance, with F1-scores of 62.04% and 50.26% on TVSum and SumMe, respectively. The use of HGCNs enables the model to better capture the complex spatial structures of video shots, and thus contributes to the improved performance of video summarisation.

提出了一种新的视频总结方法,称为双曲图卷积网络(HVSN),该方法解决了总结编辑视频和捕捉不同时间点视频镜头的语义一致性的挑战。与使用线性视频序列作为输入的现有方法不同,HVSN利用双曲图卷积网络(HGCN)和自适应图卷积邻接矩阵网络来学习和聚合视频镜头的特征。此外,还采用了基于注意力机制的特征融合机制来促进跨模块特征融合。为了评估所提出方法的性能,在TVSum和SumMe两个基准数据集上进行了实验。结果表明,HVSN取得了最先进的表现,TVSum和SumMe的F1得分分别为62.04%和50.26%。HGCN的使用使该模型能够更好地捕捉视频镜头的复杂空间结构,从而有助于提高视频总结的性能。
{"title":"Feature fusion over hyperbolic graph convolution networks for video summarisation","authors":"GuangLi Wu,&nbsp;ShengTao Wang,&nbsp;ShiPeng Xu","doi":"10.1049/cvi2.12232","DOIUrl":"10.1049/cvi2.12232","url":null,"abstract":"<p>A novel video summarisation method called the Hyperbolic Graph Convolutional Network (HVSN) is proposed, which addresses the challenges of summarising edited videos and capturing the semantic consistency of video shots at different time points. Unlike existing methods that use linear video sequences as input, HVSN leverages Hyperbolic Graph Convolutional Networks (HGCNs) and an adaptive graph convolutional adjacency matrix network to learn and aggregate features from video shots. Moreover, a feature fusion mechanism based on the attention mechanism is employed to facilitate cross-module feature fusion. To evaluate the performance of the proposed method, experiments are conducted on two benchmark datasets, TVSum and SumMe. The results demonstrate that HVSN achieves state-of-the-art performance, with F1-scores of 62.04% and 50.26% on TVSum and SumMe, respectively. The use of HGCNs enables the model to better capture the complex spatial structures of video shots, and thus contributes to the improved performance of video summarisation.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"150-164"},"PeriodicalIF":1.7,"publicationDate":"2023-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12232","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46730355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IET Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1