IET Computer Vision最新文献_第8页

STFT: Spatial and temporal feature fusion for transformer tracker STFT：变压器跟踪器的时空特征融合

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-31 DOI: 10.1049/cvi2.12233

Hao Zhang, Yan Piao, Nan Qi

Siamese-based trackers have demonstrated robust performance in object tracking, while Transformers have achieved widespread success in object detection. Currently, many researchers use a hybrid structure of convolutional neural networks and Transformers to design the backbone network of trackers, aiming to improve performance. However, this approach often underutilises the global feature extraction capability of Transformers. The authors propose a novel Transformer-based tracker that fuses spatial and temporal features. The tracker consists of a multilayer spatial feature fusion network (MSFFN), a temporal feature fusion network (TFFN), and a prediction head. The MSFFN includes two phases: feature extraction and feature fusion, and both phases are constructed with a Transformer. Compared with the hybrid structure of “CNNs + Transformer,” the proposed method enhances the continuity of feature extraction and the ability of information interaction between features, enabling comprehensive feature extraction. Moreover, to consider the temporal dimension, the authors propose a TFFN for updating the template image. The network utilises the Transformer to fuse the tracking results of multiple frames with the initial frame, allowing the template image to continuously incorporate more information and maintain the accuracy of target features. Extensive experiments show that the tracker STFT achieves state-of-the-art results on multiple benchmarks (OTB100, VOT2018, LaSOT, GOT-10K, and UAV123). Especially, the tracker STFT achieves remarkable area under the curve score of 0.652 and 0.706 on the LaSOT and OTB100 benchmark respectively.

基于暹罗的跟踪器在物体跟踪方面表现出了强大的性能，而变形金刚在物体检测方面取得了广泛的成功。目前，许多研究人员使用卷积神经网络和Transformers的混合结构来设计跟踪器的骨干网络，旨在提高性能。然而，这种方法往往没有充分利用Transformers的全局特征提取能力。作者提出了一种新的基于Transformer的跟踪器，该跟踪器融合了空间和时间特征。跟踪器由多层空间特征融合网络（MSFFN）、时间特征融合网络和预测头组成。MSFFN包括两个阶段：特征提取和特征融合，这两个阶段都是用Transformer构建的。与“CNNs+Transformer”的混合结构相比，该方法增强了特征提取的连续性和特征之间的信息交互能力，实现了全面的特征提取。此外，考虑到时间维度，作者提出了一种用于更新模板图像的TFFN。该网络利用Transformer将多个帧的跟踪结果与初始帧融合在一起，使模板图像能够持续包含更多信息，并保持目标特征的准确性。大量实验表明，跟踪器STFT在多个基准（OTB100、VOT2018、LaSOT、GOT‐10K和UAV123）上实现了最先进的结果。特别是，跟踪器STFT在LaSOT和OTB100基准上分别获得了0.652和0.706的显著曲线下面积分数。

{"title":"STFT: Spatial and temporal feature fusion for transformer tracker","authors":"Hao Zhang, Yan Piao, Nan Qi","doi":"10.1049/cvi2.12233","DOIUrl":"10.1049/cvi2.12233","url":null,"abstract":"Siamese-based trackers have demonstrated robust performance in object tracking, while Transformers have achieved widespread success in object detection. Currently, many researchers use a hybrid structure of convolutional neural networks and Transformers to design the backbone network of trackers, aiming to improve performance. However, this approach often underutilises the global feature extraction capability of Transformers. The authors propose a novel Transformer-based tracker that fuses spatial and temporal features. The tracker consists of a multilayer spatial feature fusion network (MSFFN), a temporal feature fusion network (TFFN), and a prediction head. The MSFFN includes two phases: feature extraction and feature fusion, and both phases are constructed with a Transformer. Compared with the hybrid structure of “CNNs + Transformer,” the proposed method enhances the continuity of feature extraction and the ability of information interaction between features, enabling comprehensive feature extraction. Moreover, to consider the temporal dimension, the authors propose a TFFN for updating the template image. The network utilises the Transformer to fuse the tracking results of multiple frames with the initial frame, allowing the template image to continuously incorporate more information and maintain the accuracy of target features. Extensive experiments show that the tracker STFT achieves state-of-the-art results on multiple benchmarks (OTB100, VOT2018, LaSOT, GOT-10K, and UAV123). Especially, the tracker STFT achieves remarkable area under the curve score of 0.652 and 0.706 on the LaSOT and OTB100 benchmark respectively.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"165-176"},"PeriodicalIF":1.7,"publicationDate":"2023-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12233","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42381518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A latent topic-aware network for dense video captioning 一种用于密集视频字幕的潜在主题感知网络

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-29 DOI: 10.1049/cvi2.12195

Tao Xu, Yuanyuan Cui, Xinyu He, Caihua Liu

Multiple events in a long untrimmed video possess the characteristics of similarity and continuity. These characteristics can be considered as a kind of topic semantic information, which probably behaves as same sports, similar scenes, same objects etc. Inspired by this, a novel latent topic-aware network (LTNet) is proposed in this article. The LTNet explores potential themes within videos and generates more continuous captions. Firstly, a global visual topic finder is employed to detect the similarity among events and obtain latent topic-level features. Secondly, a latent topic-oriented relation learner is designed to further enhance the topic-level representations by capturing the relationship between each event and the video themes. Benefiting from the finder and the learner, the caption generator is capable of predicting more accurate and coherent descriptions. The effectiveness of our proposed method is demonstrated on ActivityNet Captions and YouCook2 datasets, where LTNet shows a relative performance of over 3.03% and 0.50% in CIDEr score respectively.

长时间未剪辑视频中的多个事件具有相似性和连续性的特点。这些特征可以被视为一种主题语义信息，可能表现为相同的运动、相似的场景、相同的对象等。受此启发，本文提出了一种新的潜在主题感知网络（LTNet）。LTNet探索视频中的潜在主题，并生成更连续的字幕。首先，使用全局视觉主题查找器来检测事件之间的相似性，并获得潜在的主题级特征。其次，设计了一个潜在的主题导向关系学习器，通过捕捉每个事件和视频主题之间的关系，进一步增强主题层次的表征。得益于查找器和学习器，字幕生成器能够预测更准确和连贯的描述。我们提出的方法的有效性在ActivityNet字幕和YouCook2数据集上得到了验证，其中LTNet在CIDEr评分中分别显示出超过3.03%和0.50%的相对性能。

引用次数: 0

Feature fusion over hyperbolic graph convolution networks for video summarisation 用于视频摘要的双曲图卷积网络特征融合

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-25 DOI: 10.1049/cvi2.12232

GuangLi Wu, ShengTao Wang, ShiPeng Xu

A novel video summarisation method called the Hyperbolic Graph Convolutional Network (HVSN) is proposed, which addresses the challenges of summarising edited videos and capturing the semantic consistency of video shots at different time points. Unlike existing methods that use linear video sequences as input, HVSN leverages Hyperbolic Graph Convolutional Networks (HGCNs) and an adaptive graph convolutional adjacency matrix network to learn and aggregate features from video shots. Moreover, a feature fusion mechanism based on the attention mechanism is employed to facilitate cross-module feature fusion. To evaluate the performance of the proposed method, experiments are conducted on two benchmark datasets, TVSum and SumMe. The results demonstrate that HVSN achieves state-of-the-art performance, with F1-scores of 62.04% and 50.26% on TVSum and SumMe, respectively. The use of HGCNs enables the model to better capture the complex spatial structures of video shots, and thus contributes to the improved performance of video summarisation.

提出了一种新的视频总结方法，称为双曲图卷积网络（HVSN），该方法解决了总结编辑视频和捕捉不同时间点视频镜头的语义一致性的挑战。与使用线性视频序列作为输入的现有方法不同，HVSN利用双曲图卷积网络（HGCN）和自适应图卷积邻接矩阵网络来学习和聚合视频镜头的特征。此外，还采用了基于注意力机制的特征融合机制来促进跨模块特征融合。为了评估所提出方法的性能，在TVSum和SumMe两个基准数据集上进行了实验。结果表明，HVSN取得了最先进的表现，TVSum和SumMe的F1得分分别为62.04%和50.26%。HGCN的使用使该模型能够更好地捕捉视频镜头的复杂空间结构，从而有助于提高视频总结的性能。

引用次数: 0

Guest Editorial: Learning from limited annotations for computer vision tasks 客座编辑：从计算机视觉任务的有限注释中学习

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-16 DOI: 10.1049/cvi2.12229

Yazhou Yao, Wenguan Wang, Qiang Wu, Dongfang Liu, Jin Zheng

The past decade has witnessed remarkable achievements in computer vision, owing to the fast development of deep learning. With the advancement of computing power and deep learning algorithms, we can process and apply millions or even hundreds of millions of large-scale data to train robust and advanced deep learning models. In spite of the impressive success, current deep learning methods tend to rely on massive annotated training data and lack the capability of learning from limited exemplars.

However, constructing a million-scale annotated dataset like ImageNet is time-consuming, labour-intensive and even infeasible in many applications. In certain fields, very limited annotated examples can be gathered due to various reasons such as privacy or ethical issues. Consequently, one of the pressing challenges in computer vision is to develop approaches that are capable of learning from limited annotated data. The purpose of this Special Issue is to collect high-quality articles on learning from limited annotations for computer vision tasks (e.g. image classification, object detection, semantic segmentation, instance segmentation and many others), publish new ideas, theories, solutions and insights on this topic and showcase their applications.

In this Special Issue we received 29 papers, all of which underwent peer review. Of the 29 originally submitted papers, 9 have been accepted.

The nine accepted papers can be clustered into two main categories: theoretical and applications. The papers that fall into the first category are by Liu et al., Li et al. and He et al. The second category of papers offers a direct solution to various computer vision tasks. These papers are by Ma et al., Wu et al., Rao et al., Sun et al., Hou et al. and Gong et al. A brief presentation of each of the papers in this Special Issue follows.

All of the papers selected for this Special Issue show that the field of learning from limited annotations for computer vision tasks is steadily moving forward. The possibility of a weakly supervised learning paradigm will remain a source of inspiration for new techniques in the years to come.

由于深度学习的快速发展，过去十年在计算机视觉方面取得了显著成就。随着计算能力和深度学习算法的进步，我们可以处理和应用数百万甚至数亿的大规模数据，以训练健壮、先进的深度学习模型。尽管取得了令人印象深刻的成功，但当前的深度学习方法往往依赖于大量注释的训练数据，缺乏从有限的样本中学习的能力。然而，构建像ImageNet这样的百万规模注释数据集是耗时、劳动密集型的，在许多应用中甚至是不可行的。在某些领域，由于隐私或道德问题等各种原因，可以收集到非常有限的注释示例。因此，计算机视觉的一个紧迫挑战是开发能够从有限的注释数据中学习的方法。本期特刊的目的是收集关于从计算机视觉任务（如图像分类、对象检测、语义分割、实例分割等）的有限注释中学习的高质量文章，发表有关该主题的新思想、理论、解决方案和见解，并展示其应用。在本期特刊中，我们收到了29篇论文，所有论文都经过了同行评审。在最初提交的29篇论文中，有9篇已被接受。九篇被接受的论文可以分为两大类：理论和应用。属于第一类的论文是刘等人。，李等。和He等人。第二类论文提供了各种计算机视觉任务的直接解决方案。这些论文由Ma等人。，吴等。，Rao等人。，Sun等人。，Hou等人。Gong等人。以下是本期特刊中每一篇论文的简要介绍。本期特刊所选的所有论文都表明，从计算机视觉任务的有限注释中学习的领域正在稳步发展。弱监督学习范式的可能性在未来几年仍将是新技术的灵感来源。

{"title":"Guest Editorial: Learning from limited annotations for computer vision tasks","authors":"Yazhou Yao, Wenguan Wang, Qiang Wu, Dongfang Liu, Jin Zheng","doi":"10.1049/cvi2.12229","DOIUrl":"https://doi.org/10.1049/cvi2.12229","url":null,"abstract":"The past decade has witnessed remarkable achievements in computer vision, owing to the fast development of deep learning. With the advancement of computing power and deep learning algorithms, we can process and apply millions or even hundreds of millions of large-scale data to train robust and advanced deep learning models. In spite of the impressive success, current deep learning methods tend to rely on massive annotated training data and lack the capability of learning from limited exemplars.However, constructing a million-scale annotated dataset like ImageNet is time-consuming, labour-intensive and even infeasible in many applications. In certain fields, very limited annotated examples can be gathered due to various reasons such as privacy or ethical issues. Consequently, one of the pressing challenges in computer vision is to develop approaches that are capable of learning from limited annotated data. The purpose of this Special Issue is to collect high-quality articles on learning from limited annotations for computer vision tasks (e.g. image classification, object detection, semantic segmentation, instance segmentation and many others), publish new ideas, theories, solutions and insights on this topic and showcase their applications.In this Special Issue we received 29 papers, all of which underwent peer review. Of the 29 originally submitted papers, 9 have been accepted.The nine accepted papers can be clustered into two main categories: theoretical and applications. The papers that fall into the first category are by Liu et al., Li et al. and He et al. The second category of papers offers a direct solution to various computer vision tasks. These papers are by Ma et al., Wu et al., Rao et al., Sun et al., Hou et al. and Gong et al. A brief presentation of each of the papers in this Special Issue follows.All of the papers selected for this Special Issue show that the field of learning from limited annotations for computer vision tasks is steadily moving forward. The possibility of a weakly supervised learning paradigm will remain a source of inspiration for new techniques in the years to come.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 5","pages":"509-512"},"PeriodicalIF":1.7,"publicationDate":"2023-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12229","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50151226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Point completion by a Stack-Style Folding Network with multi-scaled graphical features 具有多尺度图形特征的堆叠式折叠网络的点补全

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-11 DOI: 10.1049/cvi2.12196

Yunbo Rao, Ping Xu, Shaoning Zeng, Jianping Gou

Point cloud completion is prevalent due to the insufficient results from current point cloud acquisition equipments, where a large number of point data failed to represent a relatively complete shape. Existing point cloud completion algorithms, mostly encoder‐decoder structures with grids transform (also presented as folding operation), can hardly obtain a persuasive representation of input clouds due to the issue that their bottleneck‐shape result cannot tell a precise relationship between the global and local structures. For this reason, this article proposes a novel point cloud completion model based on a Stack‐Style Folding Network (SSFN). Firstly, to enhance the deep latent feature extraction, SSFN enhances the exploitation of shape feature extractor by integrating both low‐level point feature and high‐level graphical feature. Next, a precise presentation is obtained from a high dimensional semantic space to improve the reconstruction ability. Finally, a refining module is designed to make a more evenly distributed result. Experimental results shows that our SSFN produces the most promising results of multiple representative metrics with a smaller scale parameters than current models.

由于当前点云采集设备的结果不充分，大量的点数据不能代表一个相对完整的形状，所以点云补全很普遍。现有的点云补全算法，主要是具有网格变换的编码器-解码器结构(也称为折叠操作)，由于其瓶颈形状的结果无法告诉全局和局部结构之间的精确关系，因此很难获得有说服力的输入云表示。为此，本文提出了一种基于堆栈式折叠网络(SSFN)的点云补全模型。首先，为了增强深度潜在特征提取，SSFN通过融合低水平点特征和高水平图形特征来增强形状特征提取器的开发;其次，从高维语义空间获得精确的表示，提高重构能力;最后，设计了细化模块，使结果分布更加均匀。实验结果表明，我们的SSFN产生了比现有模型更小尺度参数的多个代表性指标的最有希望的结果。

{"title":"Point completion by a Stack-Style Folding Network with multi-scaled graphical features","authors":"Yunbo Rao, Ping Xu, Shaoning Zeng, Jianping Gou","doi":"10.1049/cvi2.12196","DOIUrl":"https://doi.org/10.1049/cvi2.12196","url":null,"abstract":"Point cloud completion is prevalent due to the insufficient results from current point cloud acquisition equipments, where a large number of point data failed to represent a relatively complete shape. Existing point cloud completion algorithms, mostly encoder‐decoder structures with grids transform (also presented as folding operation), can hardly obtain a persuasive representation of input clouds due to the issue that their bottleneck‐shape result cannot tell a precise relationship between the global and local structures. For this reason, this article proposes a novel point cloud completion model based on a Stack‐Style Folding Network (SSFN). Firstly, to enhance the deep latent feature extraction, SSFN enhances the exploitation of shape feature extractor by integrating both low‐level point feature and high‐level graphical feature. Next, a precise presentation is obtained from a high dimensional semantic space to improve the reconstruction ability. Finally, a refining module is designed to make a more evenly distributed result. Experimental results shows that our SSFN produces the most promising results of multiple representative metrics with a smaller scale parameters than current models.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 1","pages":"576-585"},"PeriodicalIF":1.7,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"57700209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Point completion by a Stack-Style Folding Network with multi-scaled graphical features 具有多比例图形特征的堆叠式折叠网络的点完成

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-11 DOI: 10.1049/cvi2.12196

Yunbo Rao, Ping Xu, Shaoning Zeng, Jianping Gou

Point cloud completion is prevalent due to the insufficient results from current point cloud acquisition equipments, where a large number of point data failed to represent a relatively complete shape. Existing point cloud completion algorithms, mostly encoder-decoder structures with grids transform (also presented as folding operation), can hardly obtain a persuasive representation of input clouds due to the issue that their bottleneck-shape result cannot tell a precise relationship between the global and local structures. For this reason, this article proposes a novel point cloud completion model based on a Stack-Style Folding Network (SSFN). Firstly, to enhance the deep latent feature extraction, SSFN enhances the exploitation of shape feature extractor by integrating both low-level point feature and high-level graphical feature. Next, a precise presentation is obtained from a high dimensional semantic space to improve the reconstruction ability. Finally, a refining module is designed to make a more evenly distributed result. Experimental results shows that our SSFN produces the most promising results of multiple representative metrics with a smaller scale parameters than current models.

由于目前点云采集设备的结果不充分，大量的点数据无法代表相对完整的形状，点云完成普遍存在。现有的点云完成算法，主要是具有网格变换的编码器-解码器结构（也称为折叠操作），由于其瓶颈形状结果不能说明全局结构和局部结构之间的精确关系，因此很难获得输入云的有说服力的表示。为此，本文提出了一种新的基于堆叠式折叠网络（SSFN）的点云完成模型。首先，为了增强深层潜在特征提取，SSFN通过集成低级点特征和高级图形特征来增强形状特征提取器的开发。接下来，从高维语义空间中获得精确的表示，以提高重建能力。最后，设计了一个细化模块，使结果分布更加均匀。实验结果表明，我们的SSFN以比当前模型更小的尺度参数产生了多个代表性度量的最有希望的结果。

{"title":"Point completion by a Stack-Style Folding Network with multi-scaled graphical features","authors":"Yunbo Rao, Ping Xu, Shaoning Zeng, Jianping Gou","doi":"10.1049/cvi2.12196","DOIUrl":"https://doi.org/10.1049/cvi2.12196","url":null,"abstract":"Point cloud completion is prevalent due to the insufficient results from current point cloud acquisition equipments, where a large number of point data failed to represent a relatively complete shape. Existing point cloud completion algorithms, mostly encoder-decoder structures with grids transform (also presented as folding operation), can hardly obtain a persuasive representation of input clouds due to the issue that their bottleneck-shape result cannot tell a precise relationship between the global and local structures. For this reason, this article proposes a novel point cloud completion model based on a Stack-Style Folding Network (SSFN). Firstly, to enhance the deep latent feature extraction, SSFN enhances the exploitation of shape feature extractor by integrating both low-level point feature and high-level graphical feature. Next, a precise presentation is obtained from a high dimensional semantic space to improve the reconstruction ability. Finally, a refining module is designed to make a more evenly distributed result. Experimental results shows that our SSFN produces the most promising results of multiple representative metrics with a smaller scale parameters than current models.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 5","pages":"576-585"},"PeriodicalIF":1.7,"publicationDate":"2023-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12196","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50128438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-rank preserving embedding regression for robust image feature extraction 用于稳健图像特征提取的低秩保持嵌入回归

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-08 DOI: 10.1049/cvi2.12228

Tao Zhang, Chen-Feng Long, Yang-Jun Deng, Wei-Ye Wang, Si-Qiao Tan, Heng-Chao Li

Although low-rank representation (LRR)-based subspace learning has been widely applied for feature extraction in computer vision, how to enhance the discriminability of the low-dimensional features extracted by LRR based subspace learning methods is still a problem that needs to be further investigated. Therefore, this paper proposes a novel low-rank preserving embedding regression (LRPER) method by integrating LRR, linear regression, and projection learning into a unified framework. In LRPER, LRR can reveal the underlying structure information to strengthen the robustness of projection learning. The robust metric L_2,1-norm is employed to measure the low-rank reconstruction error and regression loss for moulding the noise and occlusions. An embedding regression is proposed to make full use of the prior information for improving the discriminability of the learned projection. In addition, an alternative iteration algorithm is designed to optimise the proposed model, and the computational complexity of the optimisation algorithm is briefly analysed. The convergence of the optimisation algorithm is theoretically and numerically studied. At last, extensive experiments on four types of image datasets are carried out to demonstrate the effectiveness of LRPER, and the experimental results demonstrate that LRPER performs better than some state-of-the-art feature extraction methods.

尽管基于低秩表示（LRR）的子空间学习已被广泛应用于计算机视觉中的特征提取，但如何提高基于LRR的子空间方法提取的低维特征的可分辨性仍然是一个需要进一步研究的问题。因此，本文提出了一种新的低秩保持嵌入回归（LRPER）方法，将LRR、线性回归和投影学习集成到一个统一的框架中。在LRPER中，LRR可以揭示底层结构信息，以增强投影学习的鲁棒性。鲁棒度量L2,1范数用于测量低阶重建误差和回归损失，用于建模噪声和遮挡。提出了一种嵌入回归，以充分利用先验信息来提高学习投影的可分辨性。此外，设计了一种替代迭代算法来优化所提出的模型，并简要分析了优化算法的计算复杂性。对优化算法的收敛性进行了理论和数值研究。最后，在四种类型的图像数据集上进行了广泛的实验，以证明LRPER的有效性，实验结果表明LRPER比一些最先进的特征提取方法表现更好。

{"title":"Low-rank preserving embedding regression for robust image feature extraction","authors":"Tao Zhang, Chen-Feng Long, Yang-Jun Deng, Wei-Ye Wang, Si-Qiao Tan, Heng-Chao Li","doi":"10.1049/cvi2.12228","DOIUrl":"10.1049/cvi2.12228","url":null,"abstract":"Although low-rank representation (LRR)-based subspace learning has been widely applied for feature extraction in computer vision, how to enhance the discriminability of the low-dimensional features extracted by LRR based subspace learning methods is still a problem that needs to be further investigated. Therefore, this paper proposes a novel low-rank preserving embedding regression (LRPER) method by integrating LRR, linear regression, and projection learning into a unified framework. In LRPER, LRR can reveal the underlying structure information to strengthen the robustness of projection learning. The robust metric L2,1-norm is employed to measure the low-rank reconstruction error and regression loss for moulding the noise and occlusions. An embedding regression is proposed to make full use of the prior information for improving the discriminability of the learned projection. In addition, an alternative iteration algorithm is designed to optimise the proposed model, and the computational complexity of the optimisation algorithm is briefly analysed. The convergence of the optimisation algorithm is theoretically and numerically studied. At last, extensive experiments on four types of image datasets are carried out to demonstrate the effectiveness of LRPER, and the experimental results demonstrate that LRPER performs better than some state-of-the-art feature extraction methods.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"124-140"},"PeriodicalIF":1.7,"publicationDate":"2023-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12228","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46800153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Visual privacy behaviour recognition for social robots based on an improved generative adversarial network 基于改进生成对抗网络的社交机器人视觉隐私行为识别

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-04 DOI: 10.1049/cvi2.12231

Guanci Yang, Jiacheng Lin, Zhidong Su, Yang Li

Although social robots equipped with visual devices may leak user information, countermeasures for ensuring privacy are not readily available, making visual privacy protection problematic. In this article, a semi-supervised learning algorithm is proposed for visual privacy behaviour recognition based on an improved generative adversarial network for social robots; it is called PBR-GAN. A 9-layer residual generator network enhances the data quality, and a 10-layer discriminator network strengthens the feature extraction. A tailored objective function, loss function, and strategy are proposed to dynamically adjust the learning rate to guarantee high performance. A social robot platform and architecture for visual privacy recognition and protection are implemented. The recognition accuracy of the proposed PBR-GAN is compared with Inception_v3, SS-GAN, and SF-GAN. The average recognition accuracy of the proposed PBR-GAN is 85.91%, which is improved by 3.93%, 9.91%, and 1.73% compared with the performance of Inception_v3, SS-GAN, and SF-GAN respectively. Through a case study, seven situations are considered related to privacy at home, and develop training and test datasets with 8,720 and 1,280 images, respectively, are developed. The proposed PBR-GAN recognises the designed visual privacy information with an average accuracy of 89.91%.

尽管配备了视觉设备的社交机器人可能会泄露用户信息，但确保隐私的对策并不容易获得，这使得视觉隐私保护成为问题。本文提出了一种基于改进的社交机器人生成对抗性网络的视觉隐私行为识别半监督学习算法；它被称为PBR-GAN。9层残差生成器网络增强了数据质量，10层鉴别器网络增强了特征提取。提出了一种定制的目标函数、损失函数和策略来动态调整学习率，以确保高性能。实现了一个用于视觉隐私识别和保护的社交机器人平台和架构。将所提出的PBR‐GAN的识别精度与Inception_v3、SS‐GAN和SF‐GAN进行了比较。所提出的PBR-GAN的平均识别准确率为85.91%，与Inception_v3、SS‐GAN和SF‐GAN的性能相比，分别提高了3.93%、9.91%和1.73%。通过案例研究，考虑了七种与家庭隐私有关的情况，并分别开发了8720张和1280张图像的训练和测试数据集。所提出的PBR-GAN识别设计的视觉隐私信息的平均准确率为89.91%。

{"title":"Visual privacy behaviour recognition for social robots based on an improved generative adversarial network","authors":"Guanci Yang, Jiacheng Lin, Zhidong Su, Yang Li","doi":"10.1049/cvi2.12231","DOIUrl":"10.1049/cvi2.12231","url":null,"abstract":"Although social robots equipped with visual devices may leak user information, countermeasures for ensuring privacy are not readily available, making visual privacy protection problematic. In this article, a semi-supervised learning algorithm is proposed for visual privacy behaviour recognition based on an improved generative adversarial network for social robots; it is called PBR-GAN. A 9-layer residual generator network enhances the data quality, and a 10-layer discriminator network strengthens the feature extraction. A tailored objective function, loss function, and strategy are proposed to dynamically adjust the learning rate to guarantee high performance. A social robot platform and architecture for visual privacy recognition and protection are implemented. The recognition accuracy of the proposed PBR-GAN is compared with Inception_v3, SS-GAN, and SF-GAN. The average recognition accuracy of the proposed PBR-GAN is 85.91%, which is improved by 3.93%, 9.91%, and 1.73% compared with the performance of Inception_v3, SS-GAN, and SF-GAN respectively. Through a case study, seven situations are considered related to privacy at home, and develop training and test datasets with 8,720 and 1,280 images, respectively, are developed. The proposed PBR-GAN recognises the designed visual privacy information with an average accuracy of 89.91%.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"110-123"},"PeriodicalIF":1.7,"publicationDate":"2023-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12231","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47731526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Determining the proper number of proposals for individual images 为单个图像确定适当数量的建议

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-03 DOI: 10.1049/cvi2.12230

Zihang He, Yong Li

The region proposal network is indispensable to two-stage object detection methods. It generates a fixed number of proposals that are to be classified and regressed by detection heads to produce detection boxes. However, the fixed number of proposals may be too large when an image contains only a few objects but too small when it contains much more objects. Considering this, the authors explored determining a proper number of proposals according to the number of objects in an image to reduce the computational cost while improving the detection accuracy. Since the number of ground truth objects is unknown at the inference stage, the authors designed a simple but effective module to predict the number of foreground regions, which will be substituted for the number of objects for determining the proposal number. Experimental results of various two-stage detection methods on different datasets, including MS-COCO, PASCAL VOC, and CrowdHuman showed that equipping the designed module increased the detection accuracy while decreasing the FLOPs of the detection head. For example, experimental results on the PASCAL VOC dataset showed that applying the designed module to Libra R-CNN and Grid R-CNN increased over 1.5 AP₅₀ while decreasing the FLOPs of detection heads from 28.6 G to nearly 9.0 G.

区域建议网络对于两阶段目标检测方法是必不可少的。它生成固定数量的建议，这些建议将由检测头进行分类和回归，以产生检测盒。然而，当图像仅包含少数对象时，固定数量的建议可能太大，而当图像包含更多对象时，建议可能太小。考虑到这一点，作者探索根据图像中对象的数量确定适当数量的建议，以降低计算成本，同时提高检测精度。由于在推理阶段，地面实况对象的数量是未知的，作者设计了一个简单但有效的模块来预测前景区域的数量，该模块将取代对象的数量来确定提案数量。各种两阶段检测方法在不同数据集（包括MS‐COCO、PASCAL VOC和CrowdHuman）上的实验结果表明，配备所设计的模块提高了检测精度，同时降低了检测头的FLOP。例如，PASCAL VOC数据集的实验结果表明，将设计的模块应用于Libra R‐CNN和Grid R‐CNN时，AP50增加了1.5以上，同时检测头的FLOP从28.6 G降低到近9.0 G。

{"title":"Determining the proper number of proposals for individual images","authors":"Zihang He, Yong Li","doi":"10.1049/cvi2.12230","DOIUrl":"10.1049/cvi2.12230","url":null,"abstract":"The region proposal network is indispensable to two-stage object detection methods. It generates a fixed number of proposals that are to be classified and regressed by detection heads to produce detection boxes. However, the fixed number of proposals may be too large when an image contains only a few objects but too small when it contains much more objects. Considering this, the authors explored determining a proper number of proposals according to the number of objects in an image to reduce the computational cost while improving the detection accuracy. Since the number of ground truth objects is unknown at the inference stage, the authors designed a simple but effective module to predict the number of foreground regions, which will be substituted for the number of objects for determining the proposal number. Experimental results of various two-stage detection methods on different datasets, including MS-COCO, PASCAL VOC, and CrowdHuman showed that equipping the designed module increased the detection accuracy while decreasing the FLOPs of the detection head. For example, experimental results on the PASCAL VOC dataset showed that applying the designed module to Libra R-CNN and Grid R-CNN increased over 1.5 AP50 while decreasing the FLOPs of detection heads from 28.6 G to nearly 9.0 G.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"141-149"},"PeriodicalIF":1.7,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12230","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43408162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zero-shot temporal event localisation: Label-free, training-free, domain-free 零时间事件定位:无标签、无训练、无领域

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-03 DOI: 10.1049/cvi2.12224

Li Sun, Ping Wang, Liuan Wang, Jun Sun, Takayuki Okatani

Temporal event localisation (TEL) has recently attracted increasing attention due to the rapid development of video platforms. Existing methods are based on either fully/weakly supervised or unsupervised learning, and thus they rely on expensive data annotation and time‐consuming training. Moreover, these models, which are trained on specific domain data, limit the model generalisation to data distribution shifts. To cope with these difficulties, the authors propose a zero‐shot TEL method that can operate without training data or annotations. Leveraging large‐scale vision and language pre‐trained models, for example, CLIP, we solve the two key problems: (1) how to find the relevant region where the event is likely to occur; (2) how to determine event duration after we find the relevant region. Query guided optimisation for local frame relevance relying on the query‐to‐frame relationship is proposed to find the most relevant frame region where the event is most likely to occur. Proposal generation method relying on the frame‐to‐frame relationship is proposed to determine the event duration. The authors also propose a greedy event sampling strategy to predict multiple durations with high reliability for the given event. The authors’ methodology is unique, offering a label‐free, training‐free, and domain‐free approach. It enables the application of TEL purely at the testing stage. The practical results show it achieves competitive performance on the standard Charades‐STA and ActivityCaptions datasets.

随着视频平台的快速发展，时间事件定位(TEL)越来越受到人们的关注。现有的方法要么基于完全/弱监督学习，要么基于无监督学习，因此它们依赖于昂贵的数据注释和耗时的训练。此外，这些模型是在特定领域数据上训练的，限制了模型的泛化到数据分布的变化。为了应对这些困难，作者提出了一种零射击TEL方法，该方法可以在没有训练数据或注释的情况下运行。利用大规模的视觉和语言预训练模型，例如CLIP，我们解决了两个关键问题:(1)如何找到事件可能发生的相关区域;(2)找到相关区域后，如何确定事件持续时间。提出了基于查询-帧关系的局部帧相关性的查询导向优化，以找到事件最有可能发生的最相关的帧区域。提出了一种基于帧间关系的提案生成方法来确定事件持续时间。作者还提出了一种贪婪事件采样策略，以高可靠性预测给定事件的多个持续时间。作者的方法是独一无二的，提供了一个标签自由，培训自由，和领域自由的方法。它使TEL的应用纯粹在测试阶段。实际结果表明，该方法在标准Charades - STA和ActivityCaptions数据集上取得了具有竞争力的性能。

{"title":"Zero-shot temporal event localisation: Label-free, training-free, domain-free","authors":"Li Sun, Ping Wang, Liuan Wang, Jun Sun, Takayuki Okatani","doi":"10.1049/cvi2.12224","DOIUrl":"https://doi.org/10.1049/cvi2.12224","url":null,"abstract":"Temporal event localisation (TEL) has recently attracted increasing attention due to the rapid development of video platforms. Existing methods are based on either fully/weakly supervised or unsupervised learning, and thus they rely on expensive data annotation and time‐consuming training. Moreover, these models, which are trained on specific domain data, limit the model generalisation to data distribution shifts. To cope with these difficulties, the authors propose a zero‐shot TEL method that can operate without training data or annotations. Leveraging large‐scale vision and language pre‐trained models, for example, CLIP, we solve the two key problems: (1) how to find the relevant region where the event is likely to occur; (2) how to determine event duration after we find the relevant region. Query guided optimisation for local frame relevance relying on the query‐to‐frame relationship is proposed to find the most relevant frame region where the event is most likely to occur. Proposal generation method relying on the frame‐to‐frame relationship is proposed to determine the event duration. The authors also propose a greedy event sampling strategy to predict multiple durations with high reliability for the given event. The authors’ methodology is unique, offering a label‐free, training‐free, and domain‐free approach. It enables the application of TEL purely at the testing stage. The practical results show it achieves competitive performance on the standard Charades‐STA and ActivityCaptions datasets.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 1","pages":"599-613"},"PeriodicalIF":1.7,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"57700647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0