IET Computer Vision最新文献_第7页

SkatingVerse: A large-scale benchmark for comprehensive evaluation on human action understanding SkatingVerse：全面评估人类动作理解的大规模基准

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-05-30 DOI: 10.1049/cvi2.12287

Ziliang Gan, Lei Jin, Yi Cheng, Yu Cheng, Yinglei Teng, Zun Li, Yawen Li, Wenhan Yang, Zheng Zhu, Junliang Xing, Jian Zhao

Human action understanding (HAU) is a broad topic that involves specific tasks, such as action localisation, recognition, and assessment. However, most popular HAU datasets are bound to one task based on particular actions. Combining different but relevant HAU tasks to establish a unified action understanding system is challenging due to the disparate actions across datasets. A large-scale and comprehensive benchmark, namely SkatingVerse is constructed for action recognition, segmentation, proposal, and assessment. SkatingVerse focus on fine-grained sport action, hence figure skating is chosen as the task object, which eliminates the biases of the object, scene, and space that exist in most previous datasets. In addition, skating actions have inherent complexity and similarity, which is an enormous challenge for current algorithms. A total of 1687 official figure skating competition videos was collected with a total of 184.4 h, exceeding four times over other datasets with a similar topic. SkatingVerse enables to formulate a unified task to output fine-grained human action classification and assessment results from a raw figure skating competition video. In addition, SkatingVerse can facilitate the study of HAU foundation model due to its large scale and abundant categories. Moreover, image modality is incorporated for human pose estimation task into SkatingVerse. Extensive experimental results show that (1) SkatingVerse significantly helps the training and evaluation of HAU methods, (2) the performance of existing HAU methods has much room to improve, and SkatingVerse helps to reduce such gaps, and (3) unifying relevant tasks in HAU through a uniform dataset can facilitate more practical applications. SkatingVerse will be publicly available to facilitate further studies on relevant problems.

人类动作理解（HAU）是一个广泛的主题，涉及动作定位、识别和评估等具体任务。然而，大多数流行的 HAU 数据集都是基于特定动作的任务。由于数据集中的动作各不相同，因此结合不同但相关的 HAU 任务来建立统一的动作理解系统具有挑战性。我们构建了一个大规模的综合基准，即 SkatingVerse，用于动作识别、分割、建议和评估。SkatingVerse 专注于细粒度的运动动作，因此选择了花样滑冰作为任务对象，从而消除了之前大多数数据集中存在的对象、场景和空间的偏差。此外，花滑动作本身具有复杂性和相似性，这对当前的算法是一个巨大的挑战。我们共收集了 1687 个官方花样滑冰比赛视频，总时长达到 184.4 小时，是其他类似主题数据集的四倍之多。SkatingVerse 能够制定统一的任务，从原始花样滑冰比赛视频中输出精细的人体动作分类和评估结果。此外，SkatingVerse 的规模大、类别多，有助于 HAU 基础模型的研究。此外，SkatingVerse 还采用了图像模式来完成人体姿势估计任务。广泛的实验结果表明：（1）SkatingVerse 对 HAU 方法的训练和评估有很大帮助；（2）现有 HAU 方法的性能还有很大提升空间，SkatingVerse 有助于缩小这些差距；（3）通过统一的数据集统一 HAU 中的相关任务可以促进更多的实际应用。SkatingVerse 将向公众开放，以促进对相关问题的进一步研究。

{"title":"SkatingVerse: A large-scale benchmark for comprehensive evaluation on human action understanding","authors":"Ziliang Gan, Lei Jin, Yi Cheng, Yu Cheng, Yinglei Teng, Zun Li, Yawen Li, Wenhan Yang, Zheng Zhu, Junliang Xing, Jian Zhao","doi":"10.1049/cvi2.12287","DOIUrl":"https://doi.org/10.1049/cvi2.12287","url":null,"abstract":"Human action understanding (HAU) is a broad topic that involves specific tasks, such as action localisation, recognition, and assessment. However, most popular HAU datasets are bound to one task based on particular actions. Combining different but relevant HAU tasks to establish a unified action understanding system is challenging due to the disparate actions across datasets. A large-scale and comprehensive benchmark, namely SkatingVerse is constructed for action recognition, segmentation, proposal, and assessment. SkatingVerse focus on fine-grained sport action, hence figure skating is chosen as the task object, which eliminates the biases of the object, scene, and space that exist in most previous datasets. In addition, skating actions have inherent complexity and similarity, which is an enormous challenge for current algorithms. A total of 1687 official figure skating competition videos was collected with a total of 184.4 h, exceeding four times over other datasets with a similar topic. SkatingVerse enables to formulate a unified task to output fine-grained human action classification and assessment results from a raw figure skating competition video. In addition, SkatingVerse can facilitate the study of HAU foundation model due to its large scale and abundant categories. Moreover, image modality is incorporated for human pose estimation task into SkatingVerse. Extensive experimental results show that (1) SkatingVerse significantly helps the training and evaluation of HAU methods, (2) the performance of existing HAU methods has much room to improve, and SkatingVerse helps to reduce such gaps, and (3) unifying relevant tasks in HAU through a uniform dataset can facilitate more practical applications. SkatingVerse will be publicly available to facilitate further studies on relevant problems.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"888-906"},"PeriodicalIF":1.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12287","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Federated finger vein presentation attack detection for various clients 针对各种客户端的联合手指静脉呈现攻击检测

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-05-30 DOI: 10.1049/cvi2.12292

Hengyu Mu, Jian Guo, Xingli Liu, Chong Han, Lijuan Sun

Recently, the application of finger vein recognition has become popular. Studies have shown finger vein presentation attacks increasingly threaten these recognition devices. As a result, research on finger vein presentation attack detection (fvPAD) methods has received much attention. However, the current fvPAD methods have two limitations. (1) Most terminal devices cannot train fvPAD models independently due to a lack of data. (2) Several research institutes can train fvPAD models; however, these models perform poorly when applied to terminal devices due to inadequate generalisation. Consequently, it is difficult for threatened terminal devices to obtain an effective fvPAD model. To address this problem, the method of federated finger vein presentation attack detection for various clients is proposed, which is the first study that introduces federated learning (FL) to fvPAD. In the proposed method, the differences in data volume and computing power between clients are considered. Traditional FL clients are expanded into two categories: institutional and terminal clients. For institutional clients, an improved triplet training mode with FL is designed to enhance model generalisation. For terminal clients, their inability is solved to obtain effective fvPAD models. Finally, extensive experiments are conducted on three datasets, which demonstrate the superiority of our method.

最近，手指静脉识别的应用变得越来越流行。研究表明，手指静脉呈现攻击对这些识别设备的威胁越来越大。因此，有关手指静脉呈现攻击检测（fvPAD）方法的研究受到了广泛关注。然而，目前的 fvPAD 方法有两个局限性。(1) 由于缺乏数据，大多数终端设备无法独立训练 fvPAD 模型。(2) 一些研究机构可以训练 fvPAD 模型，但这些模型在应用于终端设备时由于泛化不足而表现不佳。因此，受到威胁的终端设备很难获得有效的 fvPAD 模型。针对这一问题，我们提出了针对不同客户端的联合手指静脉呈现攻击检测方法，这是首次将联合学习（FL）引入 fvPAD 的研究。在提出的方法中，考虑到了不同客户端在数据量和计算能力上的差异。传统的 FL 客户机扩展为两类：机构客户机和终端客户机。针对机构客户，设计了一种改进的三元组训练模式，以提高模型的泛化能力。对于终端客户，则解决了其无法获得有效 fvPAD 模型的问题。最后，我们在三个数据集上进行了大量实验，证明了我们方法的优越性。

{"title":"Federated finger vein presentation attack detection for various clients","authors":"Hengyu Mu, Jian Guo, Xingli Liu, Chong Han, Lijuan Sun","doi":"10.1049/cvi2.12292","DOIUrl":"https://doi.org/10.1049/cvi2.12292","url":null,"abstract":"Recently, the application of finger vein recognition has become popular. Studies have shown finger vein presentation attacks increasingly threaten these recognition devices. As a result, research on finger vein presentation attack detection (fvPAD) methods has received much attention. However, the current fvPAD methods have two limitations. (1) Most terminal devices cannot train fvPAD models independently due to a lack of data. (2) Several research institutes can train fvPAD models; however, these models perform poorly when applied to terminal devices due to inadequate generalisation. Consequently, it is difficult for threatened terminal devices to obtain an effective fvPAD model. To address this problem, the method of federated finger vein presentation attack detection for various clients is proposed, which is the first study that introduces federated learning (FL) to fvPAD. In the proposed method, the differences in data volume and computing power between clients are considered. Traditional FL clients are expanded into two categories: institutional and terminal clients. For institutional clients, an improved triplet training mode with FL is designed to enhance model generalisation. For terminal clients, their inability is solved to obtain effective fvPAD models. Finally, extensive experiments are conducted on three datasets, which demonstrate the superiority of our method.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"935-949"},"PeriodicalIF":1.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12292","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Federated knowledge distillation for enhanced insulator defect detection in resource-constrained environments

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-05-27 DOI: 10.1049/cvi2.12290

Xiaohu Huang, Minghui Jia, Xianghua Tai, Wei Wang, Qi Hu, Dongping Liu, Peiheng Guo, Shengxiang Tian, Dequan Yan, Haishan Han

Insulator defect detection is crucial for the stable operation of power systems. It has become a mainstream research direction to realise insulator defect detection based on the combination of line images captured by UAVs and deep learning techniques. However, the existing high-quality insulator defect detection models still face problems such as relying on massive-labelled data and huge model parameters. Especially on resource-constrained devices, it becomes a challenge to strike a balance between model lightweighting and performance. Although the knowledge distillation technique provides a solution for model lightweighting, the loss of information in the distillation process leads to the performance degradation of small models, which in turn creates a paradox between lightweighting and performance. Hence, an insulator defect detection method based on federated knowledge distillation is proposed. The method not only realises the lightweighting of the model, but also effectively improves the model performance by collaboratively training the model through the federated learning approach. Moreover, the asynchronous aggregation approach and model freshness mechanism designed in the method further enhance the training efficiency and collaborative effect. The experimental results show that the detection accuracy and efficiency of this paper's method on public datasets are significantly better than the benchmark algorithm.

{"title":"Federated knowledge distillation for enhanced insulator defect detection in resource-constrained environments","authors":"Xiaohu Huang, Minghui Jia, Xianghua Tai, Wei Wang, Qi Hu, Dongping Liu, Peiheng Guo, Shengxiang Tian, Dequan Yan, Haishan Han","doi":"10.1049/cvi2.12290","DOIUrl":"https://doi.org/10.1049/cvi2.12290","url":null,"abstract":"Insulator defect detection is crucial for the stable operation of power systems. It has become a mainstream research direction to realise insulator defect detection based on the combination of line images captured by UAVs and deep learning techniques. However, the existing high-quality insulator defect detection models still face problems such as relying on massive-labelled data and huge model parameters. Especially on resource-constrained devices, it becomes a challenge to strike a balance between model lightweighting and performance. Although the knowledge distillation technique provides a solution for model lightweighting, the loss of information in the distillation process leads to the performance degradation of small models, which in turn creates a paradox between lightweighting and performance. Hence, an insulator defect detection method based on federated knowledge distillation is proposed. The method not only realises the lightweighting of the model, but also effectively improves the model performance by collaboratively training the model through the federated learning approach. Moreover, the asynchronous aggregation approach and model freshness mechanism designed in the method further enhance the training efficiency and collaborative effect. The experimental results show that the detection accuracy and efficiency of this paper's method on public datasets are significantly better than the benchmark algorithm.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1072-1086"},"PeriodicalIF":1.5,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12290","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Eigenspectrum regularisation reverse neighbourhood discriminative learning 特征谱正则化反向邻域判别学习

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-05-14 DOI: 10.1049/cvi2.12284

Ming Xie, Hengliang Tan, Jiao Du, Shuo Yang, Guofeng Yan, Wangwang Li, Jianwei Feng

Linear discriminant analysis is a classical method for solving problems of dimensional reduction and pattern classification. Although it has been extensively developed, however, it still suffers from various common problems, such as the Small Sample Size (SSS) and the multimodal problem. Neighbourhood linear discriminant analysis (nLDA) was recently proposed to solve the problem of multimodal class caused by the contravention of independently and identically distributed samples. However, due to the existence of many small-scale practical applications, nLDA still has to face the SSS problem, which leads to instability and poor generalisation caused by the singularity of the within-neighbourhood scatter matrix. The authors exploit the eigenspectrum regularisation techniques to circumvent the singularity of the within-neighbourhood scatter matrix of nLDA, which is called Eigenspectrum Regularisation Reverse Neighbourhood Discriminative Learning (ERRNDL). The algorithm of nLDA is reformulated as a framework by searching two projection matrices. Three eigenspectrum regularisation models are introduced to our framework to evaluate the performance. Experiments are conducted on the University of California, Irvine machine learning repository and six image classification datasets. The proposed ERRNDL-based methods achieve considerable performance.

线性判别分析是解决降维和模式分类问题的经典方法。虽然线性判别分析已经得到了广泛的发展，但它仍然存在各种常见问题，例如小样本量（SSS）和多模态问题。最近，有人提出了邻域线性判别分析（nLDA）来解决因样本不独立且同分布而导致的多模态分类问题。然而，由于存在许多小规模的实际应用，nLDA 仍然不得不面对 SSS 问题，即邻域内散点矩阵的奇异性导致的不稳定性和概括性差。作者利用高光谱正则化技术规避了 nLDA 邻域内散点矩阵的奇异性，并将其称为高光谱正则化反向邻域判别学习（ERRNDL）。通过搜索两个投影矩阵，nLDA 的算法被重新表述为一个框架。我们的框架引入了三种特征谱正则化模型来评估其性能。实验在加州大学欧文分校机器学习库和六个图像分类数据集上进行。所提出的基于ERRNDL的方法取得了可观的性能。

{"title":"Eigenspectrum regularisation reverse neighbourhood discriminative learning","authors":"Ming Xie, Hengliang Tan, Jiao Du, Shuo Yang, Guofeng Yan, Wangwang Li, Jianwei Feng","doi":"10.1049/cvi2.12284","DOIUrl":"10.1049/cvi2.12284","url":null,"abstract":"Linear discriminant analysis is a classical method for solving problems of dimensional reduction and pattern classification. Although it has been extensively developed, however, it still suffers from various common problems, such as the Small Sample Size (SSS) and the multimodal problem. Neighbourhood linear discriminant analysis (nLDA) was recently proposed to solve the problem of multimodal class caused by the contravention of independently and identically distributed samples. However, due to the existence of many small-scale practical applications, nLDA still has to face the SSS problem, which leads to instability and poor generalisation caused by the singularity of the within-neighbourhood scatter matrix. The authors exploit the eigenspectrum regularisation techniques to circumvent the singularity of the within-neighbourhood scatter matrix of nLDA, which is called Eigenspectrum Regularisation Reverse Neighbourhood Discriminative Learning (ERRNDL). The algorithm of nLDA is reformulated as a framework by searching two projection matrices. Three eigenspectrum regularisation models are introduced to our framework to evaluate the performance. Experiments are conducted on the University of California, Irvine machine learning repository and six image classification datasets. The proposed ERRNDL-based methods achieve considerable performance.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"842-858"},"PeriodicalIF":1.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12284","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140980457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CLaSP: Cross-view 6-DoF localisation assisted by synthetic panorama CLaSP：由合成全景图辅助的跨视角 6-DoF 定位系统

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-05-13 DOI: 10.1049/cvi2.12285

Juelin Zhu, Shen Yan, Xiaoya Cheng, Rouwan Wu, Yuxiang Liu, Maojun Zhang

Despite the impressive progress in visual localisation, 6-DoF cross-view localisation is still a challenging task in the computer vision community due to the huge appearance changes. To address this issue, the authors propose the CLaSP, a coarse-to-fine framework, which leverages a synthetic panorama to facilitate cross-view 6-DoF localisation in a large-scale scene. The authors first leverage a segmentation map to correct the prior pose, followed by a synthetic panorama on the ground to enable coarse pose estimation combined with a template matching method. The authors finally formulate the refine localisation process as feature matching and pose refinement to obtain the final result. The authors evaluate the performance of the CLaSP and several state-of-the-art baselines on the Airloc dataset, which demonstrates the effectiveness of our proposed framework.

尽管视觉定位技术取得了令人瞩目的进展，但由于外观变化巨大，6-DoF 跨视角定位仍然是计算机视觉领域的一项挑战性任务。为了解决这个问题，作者提出了一个从粗到细的框架--CLaSP，该框架利用合成全景图来促进大规模场景中的跨视角 6-DoF 定位。作者首先利用分割图修正先验姿态，然后利用地面合成全景图结合模板匹配方法进行粗姿态估计。最后，作者将细化定位过程表述为特征匹配和姿态细化，以获得最终结果。作者在 Airloc 数据集上评估了 CLaSP 和几种最先进基线的性能，证明了我们提出的框架的有效性。

引用次数: 0

Guest Editorial: Advanced image restoration and enhancement in the wild 特邀社论：野生图像的高级修复和增强

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-04-19 DOI: 10.1049/cvi2.12283

Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo

Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world dataset

图像复原和增强一直是计算机视觉领域的一项基本任务，被广泛应用于监控成像、遥感和医疗成像等众多领域。近年来，深度学习技术取得了显著进展。尽管在合成数据上取得了可喜的成绩，但在实际应用中仍有许多紧迫的研究挑战有待解决。这些挑战包括(i) 现实世界中低质量图像的降解模型既复杂又未知；(ii) 现实世界中难以获得成对的低质量和高质量数据，而大量真实数据是以非成对形式提供的；(iii) 将先进成像技术（如 RGB-D 相机）提供的跨模态信息纳入深度学习技术具有挑战性。(iv) 边缘设备上的实时推理对于图像修复和增强方法非常重要，以及 (v) 很难提供基于学习的方法在不同图像/区域上的置信度或性能边界。本特刊诚邀在数据集、创新架构和图像修复与增强训练方法方面的原创性文章，以应对上述挑战和其他挑战。在本特刊中，我们共收到 17 篇论文，其中 8 篇经过了同行评审，其余论文被退回。在这些经过评审的论文中，5 篇论文被接受，3 篇论文因不符合 IET 计算机视觉标准而被拒绝。最终录用的 5 篇论文可分为两类，即视频重建和图像超分辨率。第一类论文旨在重建高质量视频。第二类论文研究图像超分辨率任务。本特刊中每篇论文的简要介绍如下：Zhang 等人提出了一种用于基于事件的帧插值的点图像融合网络。事件流中的时间信息在这项任务中起着至关重要的作用，因为它提供了与图像互补的时间上下文线索。以往的方法通常通过体素化将非结构化事件数据转换为结构化数据格式，然后采用高级 CNN 提取时间信息。然而，象素化操作不可避免地会导致信息丢失，并引入冗余计算。针对这些局限性，本文提出的方法不依赖任何体素化操作，而是直接从点级别的事件中提取时间信息。然后，采用融合模块从点和图像中汇总互补线索，进行帧插值。在合成数据集和真实数据集上的实验表明，他们的方法能以高效率达到最先进的精度。为了在视频重建过程中利用相邻帧之间的时间线索，以前的大多数方法通常在初始重建之间进行对齐。然而，估计的运动通常过于粗糙，无法提供准确的时间信息。为了解决这个问题，所提出的网络采用了堆叠时移重建块来逐步增强初始重建。在每个块内，除了计算开销外，还使用高效的时移操作来捕捉时间结构。然后，采用双向对齐模块来捕捉视频序列中的时间依赖性。与以往只从关键帧中提取补充信息的方法不同，所提出的配准模块可通过双向传播从整个视频序列中接收时间信息。Qu 等人提出了一种具有三尺度编码-解码结构的轻量级视频帧插值网络。具体来说，首先从输入视频中提取多尺度运动信息。然后，采用递归卷积层来提炼结果特征。然后，对结果特征进行聚合，生成高质量的插值帧。在 CelebA 和 Helen 数据集上的实验结果表明，所提出的方法在使用较少参数的情况下优于最先进的方法。之前的大多数方法都采用多任务学习范式，在对低分辨率图像进行超分辨率处理的同时进行地标检测。然而，这些方法需要额外的注释成本，而且提取的面部先验结构通常质量不高。

{"title":"Guest Editorial: Advanced image restoration and enhancement in the wild","authors":"Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo","doi":"10.1049/cvi2.12283","DOIUrl":"https://doi.org/10.1049/cvi2.12283","url":null,"abstract":"Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world dataset","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"435-438"},"PeriodicalIF":1.7,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12283","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141246088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal channel reconfiguration multi-graph convolution network for skeleton-based action recognition 基于骨架的动作识别时态信道重构多图卷积网络

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-04-17 DOI: 10.1049/cvi2.12279

Siyue Lei, Bin Tang, Yanhua Chen, Mingfu Zhao, Yifei Xu, Zourong Long

Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.

基于骨架的动作识别在人类动作识别领域受到广泛关注，并取得了显著成就。在不同尺度的时间序列动作预测中，现有方法主要关注注意力机制，以增强空间维度的建模能力。然而，这种方法严重依赖于单一输入特征的局部信息，无法促进通道间的信息流动。为了解决这些问题，作者提出了一种新颖的时空通道重构多图卷积网络（TRMGCN）。在时空卷积部分，作者设计了一个名为 "带引导的时空信道融合（TCFG）"的模块，以捕捉不同尺度信道内的重要时空信息，避免忽略关节点之间的跨时空依赖关系。在图卷积部分，作者提出了自上而下注意力多图独立卷积（TD-MIG），它使用多图独立卷积来学习不同长度时间序列的拓扑图特征。在空间和信道调制中引入了自上而下注意，以促进不建立拓扑关系的信道中的信息流动。在大型数据集 NTU-RGB + D60 和 120 以及 UAV-Human 上的实验结果表明，TRMGCN 具有先进的性能和能力。此外，在较小数据集 NW-UCLA 上的实验结果表明，作者的模型具有很强的泛化能力。

{"title":"Temporal channel reconfiguration multi-graph convolution network for skeleton-based action recognition","authors":"Siyue Lei, Bin Tang, Yanhua Chen, Mingfu Zhao, Yifei Xu, Zourong Long","doi":"10.1049/cvi2.12279","DOIUrl":"10.1049/cvi2.12279","url":null,"abstract":"Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"813-825"},"PeriodicalIF":1.5,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12279","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140693975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Instance segmentation by blend U-Net and VOLO network 通过混合 U-Net 和 VOLO 网络进行实例分割

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-04-09 DOI: 10.1049/cvi2.12275

Hongfei Deng, Bin Wen, Rui Wang, Zuwei Feng

Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.

要在重叠、密集和大量的目标对象上正确区分不同的实例，实例分割仍然是一个难题。为此，作者将实例分割问题简化为实例分类问题，并提出了一种新颖的端到端训练型实例分割算法 CotuNet。首先，该算法结合了卷积神经网络（CNN）、Outlooker 和 Transformer，设计出一种新的混合编码器（COT），以进一步提取特征。它包括使用 CNN 提取图像的低级特征，然后通过 Outlooker 提取更精细的局部数据表示。然后，利用变换器将数据表示聚合到本地空间，生成全局上下文信息。最后，级联上采样和跳接模块的组合被用作解码器（C-UP），以实现多个不同尺度的高分辨率信息的融合，从而生成准确的掩码。通过在 CVPPP 2017 数据集上进行验证，并与之前最先进的方法进行比较，CotuNet 显示出卓越的竞争力和分割性能。

{"title":"Instance segmentation by blend U-Net and VOLO network","authors":"Hongfei Deng, Bin Wen, Rui Wang, Zuwei Feng","doi":"10.1049/cvi2.12275","DOIUrl":"10.1049/cvi2.12275","url":null,"abstract":"Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"735-744"},"PeriodicalIF":1.5,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12275","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140726439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Person re-identification via deep compound eye network and pose repair module 通过深度复眼网络和姿势修复模块进行人员再识别

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-04-04 DOI: 10.1049/cvi2.12282

Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan

Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.

人员再识别的目的是从不相交的摄像机中搜索特定的目标行人。然而，在真实的复杂场景中，行人很容易被遮挡，这使得目标行人搜索任务变得耗时且具有挑战性。针对行人易被遮挡的问题，提出了一种通过深度复眼网络（CEN）和姿态修复模块进行人脸再识别的方法，包括：（1）提出了一种基于多摄像头逻辑拓扑结构的深度复眼网络，采用图卷积和门控递归单元捕捉行人行走的时空信息，最后通过连体网络进行行人全局匹配；（2）设计了一种集成的时空信息聚合网络，以方便姿态修复。利用多级逻辑拓扑相机下的目标行人特征作为辅助信息，修复被遮挡的目标行人图像，从而降低姿势变化导致的行人不匹配影响；（3）引入 CEN 和姿势修复网络的联合优化机制，多相机逻辑拓扑推理为姿势修复网络提供辅助信息和检索顺序。作者在多个数据集上进行了实验，包括 Occluded-DukeMTMC、CUHK-SYSU、PRW、SLP 和 UJS-reID。结果表明，作者的方法在这些数据集上都取得了显著的性能。具体来说，在 CUHK-SYSU 数据集上，作者的模型在识别闭塞个体方面达到了 89.1% 的最高准确率和 83.1% 的平均准确率。

{"title":"Person re-identification via deep compound eye network and pose repair module","authors":"Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan","doi":"10.1049/cvi2.12282","DOIUrl":"10.1049/cvi2.12282","url":null,"abstract":"Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"826-841"},"PeriodicalIF":1.5,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12282","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140741587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Video frame interpolation via spatial multi-scale modelling 通过空间多尺度建模进行视频帧插值

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-04-03 DOI: 10.1049/cvi2.12281

Zhe Qu, Weijing Liu, Lizhen Cui, Xiaohui Yang

Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.

视频帧插值（VFI）是一种在相邻原始视频帧之间合成中间帧以增强视频时间超分辨率的技术。然而，现有方法通常依赖于参数数量庞大的重型模型架构。作者介绍了一种基于多个轻量级卷积单元和局部三尺度编码（LTSE）结构的高效 VFI 网络。作者特别介绍了一种具有两级注意级联的 LTSE 结构。这种设计旨在提高对图像中不同尺度的细节和上下文信息的捕捉效率。其次，作者引入了递归卷积层（RCL）和残差操作，设计了递归残差卷积单元来优化 LTSE 结构。此外，作者还引入了一种名为 "可分离递归残差卷积单元 "的轻量级卷积单元，以减少模型参数。最后，作者从解码器中获得了三比例解码特征，并将其翘曲为一组三比例预翘曲图。作者将它们融合到合成网络中，生成高质量的插值帧。实验结果表明，所提出的方法以较少的模型参数实现了卓越的性能。

{"title":"Video frame interpolation via spatial multi-scale modelling","authors":"Zhe Qu, Weijing Liu, Lizhen Cui, Xiaohui Yang","doi":"10.1049/cvi2.12281","DOIUrl":"10.1049/cvi2.12281","url":null,"abstract":"Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"458-472"},"PeriodicalIF":1.7,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140746884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0