首页 > 最新文献

IET Computer Vision最新文献

英文 中文
Guest Editorial: Advanced image restoration and enhancement in the wild 特邀社论:野生图像的高级修复和增强
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-19 DOI: 10.1049/cvi2.12283
Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo
<p>Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.</p><p>In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.</p><p>The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.</p><p>Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world dataset
图像复原和增强一直是计算机视觉领域的一项基本任务,被广泛应用于监控成像、遥感和医疗成像等众多领域。近年来,深度学习技术取得了显著进展。尽管在合成数据上取得了可喜的成绩,但在实际应用中仍有许多紧迫的研究挑战有待解决。这些挑战包括(i) 现实世界中低质量图像的降解模型既复杂又未知;(ii) 现实世界中难以获得成对的低质量和高质量数据,而大量真实数据是以非成对形式提供的;(iii) 将先进成像技术(如 RGB-D 相机)提供的跨模态信息纳入深度学习技术具有挑战性。(iv) 边缘设备上的实时推理对于图像修复和增强方法非常重要,以及 (v) 很难提供基于学习的方法在不同图像/区域上的置信度或性能边界。本特刊诚邀在数据集、创新架构和图像修复与增强训练方法方面的原创性文章,以应对上述挑战和其他挑战。在本特刊中,我们共收到 17 篇论文,其中 8 篇经过了同行评审,其余论文被退回。在这些经过评审的论文中,5 篇论文被接受,3 篇论文因不符合 IET 计算机视觉标准而被拒绝。最终录用的 5 篇论文可分为两类,即视频重建和图像超分辨率。第一类论文旨在重建高质量视频。第二类论文研究图像超分辨率任务。本特刊中每篇论文的简要介绍如下:Zhang 等人提出了一种用于基于事件的帧插值的点图像融合网络。事件流中的时间信息在这项任务中起着至关重要的作用,因为它提供了与图像互补的时间上下文线索。以往的方法通常通过体素化将非结构化事件数据转换为结构化数据格式,然后采用高级 CNN 提取时间信息。然而,象素化操作不可避免地会导致信息丢失,并引入冗余计算。针对这些局限性,本文提出的方法不依赖任何体素化操作,而是直接从点级别的事件中提取时间信息。然后,采用融合模块从点和图像中汇总互补线索,进行帧插值。在合成数据集和真实数据集上的实验表明,他们的方法能以高效率达到最先进的精度。为了在视频重建过程中利用相邻帧之间的时间线索,以前的大多数方法通常在初始重建之间进行对齐。然而,估计的运动通常过于粗糙,无法提供准确的时间信息。为了解决这个问题,所提出的网络采用了堆叠时移重建块来逐步增强初始重建。在每个块内,除了计算开销外,还使用高效的时移操作来捕捉时间结构。然后,采用双向对齐模块来捕捉视频序列中的时间依赖性。与以往只从关键帧中提取补充信息的方法不同,所提出的配准模块可通过双向传播从整个视频序列中接收时间信息。Qu 等人提出了一种具有三尺度编码-解码结构的轻量级视频帧插值网络。具体来说,首先从输入视频中提取多尺度运动信息。然后,采用递归卷积层来提炼结果特征。然后,对结果特征进行聚合,生成高质量的插值帧。在 CelebA 和 Helen 数据集上的实验结果表明,所提出的方法在使用较少参数的情况下优于最先进的方法。之前的大多数方法都采用多任务学习范式,在对低分辨率图像进行超分辨率处理的同时进行地标检测。然而,这些方法需要额外的注释成本,而且提取的面部先验结构通常质量不高。
{"title":"Guest Editorial: Advanced image restoration and enhancement in the wild","authors":"Longguang Wang,&nbsp;Juncheng Li,&nbsp;Naoto Yokoya,&nbsp;Radu Timofte,&nbsp;Yulan Guo","doi":"10.1049/cvi2.12283","DOIUrl":"https://doi.org/10.1049/cvi2.12283","url":null,"abstract":"&lt;p&gt;Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.&lt;/p&gt;&lt;p&gt;In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.&lt;/p&gt;&lt;p&gt;The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.&lt;/p&gt;&lt;p&gt;Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world dataset","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"435-438"},"PeriodicalIF":1.7,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12283","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141246088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal channel reconfiguration multi-graph convolution network for skeleton-based action recognition 基于骨架的动作识别时态信道重构多图卷积网络
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-17 DOI: 10.1049/cvi2.12279
Siyue Lei, Bin Tang, Yanhua Chen, Mingfu Zhao, Yifei Xu, Zourong Long

Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.

基于骨架的动作识别在人类动作识别领域受到广泛关注,并取得了显著成就。在不同尺度的时间序列动作预测中,现有方法主要关注注意力机制,以增强空间维度的建模能力。然而,这种方法严重依赖于单一输入特征的局部信息,无法促进通道间的信息流动。为了解决这些问题,作者提出了一种新颖的时空通道重构多图卷积网络(TRMGCN)。在时空卷积部分,作者设计了一个名为 "带引导的时空信道融合(TCFG)"的模块,以捕捉不同尺度信道内的重要时空信息,避免忽略关节点之间的跨时空依赖关系。在图卷积部分,作者提出了自上而下注意力多图独立卷积(TD-MIG),它使用多图独立卷积来学习不同长度时间序列的拓扑图特征。在空间和信道调制中引入了自上而下注意,以促进不建立拓扑关系的信道中的信息流动。在大型数据集 NTU-RGB + D60 和 120 以及 UAV-Human 上的实验结果表明,TRMGCN 具有先进的性能和能力。此外,在较小数据集 NW-UCLA 上的实验结果表明,作者的模型具有很强的泛化能力。
{"title":"Temporal channel reconfiguration multi-graph convolution network for skeleton-based action recognition","authors":"Siyue Lei,&nbsp;Bin Tang,&nbsp;Yanhua Chen,&nbsp;Mingfu Zhao,&nbsp;Yifei Xu,&nbsp;Zourong Long","doi":"10.1049/cvi2.12279","DOIUrl":"10.1049/cvi2.12279","url":null,"abstract":"<p>Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"813-825"},"PeriodicalIF":1.5,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12279","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140693975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Instance segmentation by blend U-Net and VOLO network 通过混合 U-Net 和 VOLO 网络进行实例分割
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-09 DOI: 10.1049/cvi2.12275
Hongfei Deng, Bin Wen, Rui Wang, Zuwei Feng

Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.

要在重叠、密集和大量的目标对象上正确区分不同的实例,实例分割仍然是一个难题。为此,作者将实例分割问题简化为实例分类问题,并提出了一种新颖的端到端训练型实例分割算法 CotuNet。首先,该算法结合了卷积神经网络(CNN)、Outlooker 和 Transformer,设计出一种新的混合编码器(COT),以进一步提取特征。它包括使用 CNN 提取图像的低级特征,然后通过 Outlooker 提取更精细的局部数据表示。然后,利用变换器将数据表示聚合到本地空间,生成全局上下文信息。最后,级联上采样和跳接模块的组合被用作解码器(C-UP),以实现多个不同尺度的高分辨率信息的融合,从而生成准确的掩码。通过在 CVPPP 2017 数据集上进行验证,并与之前最先进的方法进行比较,CotuNet 显示出卓越的竞争力和分割性能。
{"title":"Instance segmentation by blend U-Net and VOLO network","authors":"Hongfei Deng,&nbsp;Bin Wen,&nbsp;Rui Wang,&nbsp;Zuwei Feng","doi":"10.1049/cvi2.12275","DOIUrl":"10.1049/cvi2.12275","url":null,"abstract":"<p>Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"735-744"},"PeriodicalIF":1.5,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12275","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140726439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Person re-identification via deep compound eye network and pose repair module 通过深度复眼网络和姿势修复模块进行人员再识别
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-04 DOI: 10.1049/cvi2.12282
Hongjian Gu, Wenxuan Zou, Keyang Cheng, Bin Wu, Humaira Abdul Ghafoor, Yongzhao Zhan

Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.

人员再识别的目的是从不相交的摄像机中搜索特定的目标行人。然而,在真实的复杂场景中,行人很容易被遮挡,这使得目标行人搜索任务变得耗时且具有挑战性。针对行人易被遮挡的问题,提出了一种通过深度复眼网络(CEN)和姿态修复模块进行人脸再识别的方法,包括:(1)提出了一种基于多摄像头逻辑拓扑结构的深度复眼网络,采用图卷积和门控递归单元捕捉行人行走的时空信息,最后通过连体网络进行行人全局匹配;(2)设计了一种集成的时空信息聚合网络,以方便姿态修复。利用多级逻辑拓扑相机下的目标行人特征作为辅助信息,修复被遮挡的目标行人图像,从而降低姿势变化导致的行人不匹配影响;(3)引入 CEN 和姿势修复网络的联合优化机制,多相机逻辑拓扑推理为姿势修复网络提供辅助信息和检索顺序。作者在多个数据集上进行了实验,包括 Occluded-DukeMTMC、CUHK-SYSU、PRW、SLP 和 UJS-reID。结果表明,作者的方法在这些数据集上都取得了显著的性能。具体来说,在 CUHK-SYSU 数据集上,作者的模型在识别闭塞个体方面达到了 89.1% 的最高准确率和 83.1% 的平均准确率。
{"title":"Person re-identification via deep compound eye network and pose repair module","authors":"Hongjian Gu,&nbsp;Wenxuan Zou,&nbsp;Keyang Cheng,&nbsp;Bin Wu,&nbsp;Humaira Abdul Ghafoor,&nbsp;Yongzhao Zhan","doi":"10.1049/cvi2.12282","DOIUrl":"10.1049/cvi2.12282","url":null,"abstract":"<p>Person re-identification is aimed at searching for specific target pedestrians from non-intersecting cameras. However, in real complex scenes, pedestrians are easily obscured, which makes the target pedestrian search task time-consuming and challenging. To address the problem of pedestrians' susceptibility to occlusion, a person re-identification via deep compound eye network (CEN) and pose repair module is proposed, which includes (1) A deep CEN based on multi-camera logical topology is proposed, which adopts graph convolution and a Gated Recurrent Unit to capture the temporal and spatial information of pedestrian walking and finally carries out pedestrian global matching through the Siamese network; (2) An integrated spatial-temporal information aggregation network is designed to facilitate pose repair. The target pedestrian features under the multi-level logic topology camera are utilised as auxiliary information to repair the occluded target pedestrian image, so as to reduce the impact of pedestrian mismatch due to pose changes; (3) A joint optimisation mechanism of CEN and pose repair network is introduced, where multi-camera logical topology inference provides auxiliary information and retrieval order for the pose repair network. The authors conducted experiments on multiple datasets, including Occluded-DukeMTMC, CUHK-SYSU, PRW, SLP, and UJS-reID. The results indicate that the authors’ method achieved significant performance across these datasets. Specifically, on the CUHK-SYSU dataset, the authors’ model achieved a top-1 accuracy of 89.1% and a mean Average Precision accuracy of 83.1% in the recognition of occluded individuals.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"826-841"},"PeriodicalIF":1.5,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12282","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140741587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Video frame interpolation via spatial multi-scale modelling 通过空间多尺度建模进行视频帧插值
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-03 DOI: 10.1049/cvi2.12281
Zhe Qu, Weijing Liu, Lizhen Cui, Xiaohui Yang

Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.

视频帧插值(VFI)是一种在相邻原始视频帧之间合成中间帧以增强视频时间超分辨率的技术。然而,现有方法通常依赖于参数数量庞大的重型模型架构。作者介绍了一种基于多个轻量级卷积单元和局部三尺度编码(LTSE)结构的高效 VFI 网络。作者特别介绍了一种具有两级注意级联的 LTSE 结构。这种设计旨在提高对图像中不同尺度的细节和上下文信息的捕捉效率。其次,作者引入了递归卷积层(RCL)和残差操作,设计了递归残差卷积单元来优化 LTSE 结构。此外,作者还引入了一种名为 "可分离递归残差卷积单元 "的轻量级卷积单元,以减少模型参数。最后,作者从解码器中获得了三比例解码特征,并将其翘曲为一组三比例预翘曲图。作者将它们融合到合成网络中,生成高质量的插值帧。实验结果表明,所提出的方法以较少的模型参数实现了卓越的性能。
{"title":"Video frame interpolation via spatial multi-scale modelling","authors":"Zhe Qu,&nbsp;Weijing Liu,&nbsp;Lizhen Cui,&nbsp;Xiaohui Yang","doi":"10.1049/cvi2.12281","DOIUrl":"10.1049/cvi2.12281","url":null,"abstract":"<p>Video frame interpolation (VFI) is a technique that synthesises intermediate frames between adjacent original video frames to enhance the temporal super-resolution of the video. However, existing methods usually rely on heavy model architectures with a large number of parameters. The authors introduce an efficient VFI network based on multiple lightweight convolutional units and a Local three-scale encoding (LTSE) structure. In particular, the authors introduce a LTSE structure with two-level attention cascades. This design is tailored to enhance the efficient capture of details and contextual information across diverse scales in images. Secondly, the authors introduce recurrent convolutional layers (RCL) and residual operations, designing the recurrent residual convolutional unit to optimise the LTSE structure. Additionally, a lightweight convolutional unit named separable recurrent residual convolutional unit is introduced to reduce the model parameters. Finally, the authors obtain the three-scale decoding features from the decoder and warp them for a set of three-scale pre-warped maps. The authors fuse them into the synthesis network to generate high-quality interpolated frames. The experimental results indicate that the proposed approach achieves superior performance with fewer model parameters.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"458-472"},"PeriodicalIF":1.7,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12281","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140746884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Continuous-dilated temporal and inter-frame motion excitation feature learning for gait recognition 用于步态识别的连续时间和帧间运动激励特征学习
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-01 DOI: 10.1049/cvi2.12278
Chunsheng Hua, Hao Zhang, Jia Li, Yingjie Pan

The authors present global-interval and local-continuous feature extraction networks for gait recognition. Unlike conventional gait recognition methods focussing on the full gait cycle, the authors introduce a novel global- continuous-dilated temporal feature extraction (TFE) to extract continuous and interval motion features from the silhouette frames globally. Simultaneously, an inter-frame motion excitation (IME) module is proposed to enhance the unique motion expression of an individual, which remains unchanged regardless of clothing variations. The spatio-temporal features extracted from the TFE and IME modules are then weighted and concatenated by an adaptive aggregator network for recognition. Through the experiments over CASIA-B and mini-OUMVLP datasets, the proposed method has shown the comparable performance (as 98%, 95%, and 84.9% in the normal walking, carrying a bag or packbag, and wearing coats or jackets categories in CASIA-B, and 89% in mini-OUMVLP) to the other state-of-the-art approaches. Extensive experiments conducted on the CASIA-B and mini-OUMVLP datasets have demonstrated the comparable performance of our proposed method compared to other state-of-the-art approaches.

作者提出了用于步态识别的全局间隔和局部连续特征提取网络。与关注整个步态周期的传统步态识别方法不同,作者引入了一种新颖的全局-连续-稀释时间特征提取(TFE)方法,从全局剪影帧中提取连续和间隔运动特征。同时,作者还提出了一个帧间运动激励(IME)模块,以增强个人独特的运动表达,这种表达不受服装变化的影响。从 TFE 和 IME 模块中提取的时空特征经自适应聚合网络加权和串联后进行识别。通过在 CASIA-B 和 mini-OUMVLP 数据集上的实验,所提出的方法表现出了与其他先进方法相当的性能(在 CASIA-B 中,正常行走、背包或背包、穿外套或夹克类别的识别率分别为 98%、95% 和 84.9%;在 mini-OUMVLP 中,识别率为 89%)。在 CASIA-B 和 mini-OUMVLP 数据集上进行的大量实验表明,与其他先进方法相比,我们提出的方法性能相当。
{"title":"Continuous-dilated temporal and inter-frame motion excitation feature learning for gait recognition","authors":"Chunsheng Hua,&nbsp;Hao Zhang,&nbsp;Jia Li,&nbsp;Yingjie Pan","doi":"10.1049/cvi2.12278","DOIUrl":"10.1049/cvi2.12278","url":null,"abstract":"<p>The authors present global-interval and local-continuous feature extraction networks for gait recognition. Unlike conventional gait recognition methods focussing on the full gait cycle, the authors introduce a novel global- continuous-dilated temporal feature extraction (<i>TFE</i>) to extract continuous and interval motion features from the silhouette frames globally. Simultaneously, an inter-frame motion excitation (<i>IME</i>) module is proposed to enhance the unique motion expression of an individual, which remains unchanged regardless of clothing variations. The spatio-temporal features extracted from the <i>TFE</i> and <i>IME</i> modules are then weighted and concatenated by an adaptive aggregator network for recognition. Through the experiments over CASIA-B and mini-OUMVLP datasets, the proposed method has shown the comparable performance (as 98%, 95%, and 84.9% in the normal walking, carrying a bag or packbag, and wearing coats or jackets categories in CASIA-B, and 89% in mini-OUMVLP) to the other state-of-the-art approaches. Extensive experiments conducted on the CASIA-B and mini-OUMVLP datasets have demonstrated the comparable performance of our proposed method compared to other state-of-the-art approaches.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"788-800"},"PeriodicalIF":1.5,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12278","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140781350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pruning-guided feature distillation for an efficient transformer-based pose estimation model 基于变压器的高效姿态估计模型的剪枝引导特征提炼
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-03-31 DOI: 10.1049/cvi2.12277
Dong-hwi Kim, Dong-hun Lee, Aro Kim, Jinwoo Jeong, Jong Taek Lee, Sungjei Kim, Sang-hyo Park

The authors propose a compression strategy for a 3D human pose estimation model based on a transformer which yields high accuracy but increases the model size. This approach involves a pruning-guided determination of the search range to achieve lightweight pose estimation under limited training time and to identify the optimal model size. In addition, the authors propose a transformer-based feature distillation (TFD) method, which efficiently exploits the pose estimation model in terms of both model size and accuracy by leveraging transformer architecture characteristics. Pruning-guided TFD is the first approach for 3D human pose estimation that employs transformer architecture. The proposed approach was tested on various extensive data sets, and the results show that it can reduce the model size by 30% compared to the state-of-the-art while ensuring high accuracy.

作者提出了一种基于变压器的三维人体姿态估计模型压缩策略,该策略可获得高精度,但会增加模型大小。这种方法包括在剪枝指导下确定搜索范围,以便在有限的训练时间内实现轻量级姿势估计,并确定最佳模型大小。此外,作者还提出了一种基于变压器的特征蒸馏(TFD)方法,该方法利用变压器架构的特点,在模型大小和精度方面有效地利用了姿势估计模型。剪枝引导的 TFD 是第一种采用变压器架构的三维人体姿态估计方法。我们在各种广泛的数据集上对所提出的方法进行了测试,结果表明,与最先进的方法相比,该方法能在确保高精度的同时将模型大小减少 30%。
{"title":"Pruning-guided feature distillation for an efficient transformer-based pose estimation model","authors":"Dong-hwi Kim,&nbsp;Dong-hun Lee,&nbsp;Aro Kim,&nbsp;Jinwoo Jeong,&nbsp;Jong Taek Lee,&nbsp;Sungjei Kim,&nbsp;Sang-hyo Park","doi":"10.1049/cvi2.12277","DOIUrl":"https://doi.org/10.1049/cvi2.12277","url":null,"abstract":"<p>The authors propose a compression strategy for a 3D human pose estimation model based on a transformer which yields high accuracy but increases the model size. This approach involves a pruning-guided determination of the search range to achieve lightweight pose estimation under limited training time and to identify the optimal model size. In addition, the authors propose a transformer-based feature distillation (TFD) method, which efficiently exploits the pose estimation model in terms of both model size and accuracy by leveraging transformer architecture characteristics. Pruning-guided TFD is the first approach for 3D human pose estimation that employs transformer architecture. The proposed approach was tested on various extensive data sets, and the results show that it can reduce the model size by 30% compared to the state-of-the-art while ensuring high accuracy.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"745-758"},"PeriodicalIF":1.5,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12277","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142158672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prompt guidance query with cascaded constraint decoders for human–object interaction detection 利用级联约束解码器进行提示引导查询,以检测人与物体之间的交互作用
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-03-29 DOI: 10.1049/cvi2.12276
Sheng Liu, Bingnan Guo, Feng Zhang, Junhao Chen, Ruixiang Chen

Human–object interaction (HOI) detection, which localises and recognises interactions between human and object, requires high-level image and scene understanding. Recent methods for HOI detection typically utilise transformer-based architecture to build unified future representation. However, these methods use random initial queries to predict interactive human–object pairs, leading to a lack of prior knowledge. Furthermore, most methods provide unified features to forecast interactions using conventional decoder structures, but they lack the ability to build efficient multi-task representations. To address these problems, we propose a novel two-stage HOI detector called PGCD, mainly consisting of prompt guidance query and cascaded constraint decoders. Firstly, the authors propose a novel prompt guidance query generation module (PGQ) to introduce the guidance-semantic features. In PGQ, the authors build visual-semantic transfer to obtain fuller semantic representations. In addition, a cascaded constraint decoder architecture (CD) with random masks is designed to build fine-grained interaction features and improve the model's generalisation performance. Experimental results demonstrate that the authors’ proposed approach obtains significant performance on the two widely used benchmarks, that is, HICO-DET and V-COCO.

人-物互动(HOI)检测可定位和识别人与物体之间的互动,需要对图像和场景有较高的理解能力。最近的 HOI 检测方法通常利用基于变换器的架构来建立统一的未来表示法。然而,这些方法使用随机初始查询来预测交互式人-物对,导致缺乏先验知识。此外,大多数方法使用传统的解码器结构提供统一的特征来预测交互,但它们缺乏建立高效的多任务表征的能力。为了解决这些问题,我们提出了一种名为 PGCD 的新型两阶段 HOI 检测器,主要由提示引导查询和级联约束解码器组成。首先,作者提出了一个新颖的提示引导查询生成模块(PGQ)来引入引导语义特征。在 PGQ 中,作者建立了视觉-语义转移以获得更全面的语义表征。此外,作者还设计了带有随机掩码的级联约束解码器架构(CD),以建立细粒度的交互特征,提高模型的泛化性能。实验结果表明,作者提出的方法在两个广泛使用的基准(即 HICO-DET 和 V-COCO)上取得了显著的性能。
{"title":"Prompt guidance query with cascaded constraint decoders for human–object interaction detection","authors":"Sheng Liu,&nbsp;Bingnan Guo,&nbsp;Feng Zhang,&nbsp;Junhao Chen,&nbsp;Ruixiang Chen","doi":"10.1049/cvi2.12276","DOIUrl":"10.1049/cvi2.12276","url":null,"abstract":"<p>Human–object interaction (HOI) detection, which localises and recognises interactions between human and object, requires high-level image and scene understanding. Recent methods for HOI detection typically utilise transformer-based architecture to build unified future representation. However, these methods use random initial queries to predict interactive human–object pairs, leading to a lack of prior knowledge. Furthermore, most methods provide unified features to forecast interactions using conventional decoder structures, but they lack the ability to build efficient multi-task representations. To address these problems, we propose a novel two-stage HOI detector called PGCD, mainly consisting of prompt guidance query and cascaded constraint decoders. Firstly, the authors propose a novel prompt guidance query generation module (PGQ) to introduce the guidance-semantic features. In PGQ, the authors build visual-semantic transfer to obtain fuller semantic representations. In addition, a cascaded constraint decoder architecture (CD) with random masks is designed to build fine-grained interaction features and improve the model's generalisation performance. Experimental results demonstrate that the authors’ proposed approach obtains significant performance on the two widely used benchmarks, that is, HICO-DET and V-COCO.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"772-787"},"PeriodicalIF":1.5,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12276","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140366408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint image restoration for object detection in snowy weather 雪天物体检测的联合图像复原
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-03-27 DOI: 10.1049/cvi2.12274
Jing Wang, Meimei Xu, Huazhu Xue, Zhanqiang Huo, Fen Luo

Although existing object detectors achieve encouraging performance of object detection and localisation under real ideal conditions, the detection performance in adverse weather conditions (snowy) is very poor and not enough to cope with the detection task in adverse weather conditions. Existing methods do not deal well with the effect of snow on the identity of object features or usually ignore or even discard potential information that can help improve the detection performance. To this end, the authors propose a novel and improved end-to-end object detection network joint image restoration. Specifically, in order to address the problem of identity degradation of object detection due to snow, an ingenious restoration-detection dual branch network structure combined with a Multi-Integrated Attention module is proposed, which can well mitigate the effect of snow on the identity of object features, thus improving the detection performance of the detector. In order to make more effective use of the features that are beneficial to the detection task, a Self-Adaptive Feature Fusion module is introduced, which can help the network better learn the potential features that are beneficial to the detection and eliminate the effect of heavy or large local snow in the object area on detection by a special feature fusion, thus improving the network's detection capability in snowy. In addition, the authors construct a large-scale, multi-size snowy dataset called Synthetic and Real Snowy Dataset (SRSD), and it is a good and necessary complement and improvement to the existing snowy-related tasks. Extensive experiments on a public snowy dataset (Snowy-weather Datasets) and SRSD indicate that our method outperforms the existing state-of-the-art object detectors.

虽然现有的物体检测器在真实理想条件下的物体检测和定位性能令人鼓舞,但在恶劣天气条件下(下雪)的检测性能却非常差,不足以应对恶劣天气条件下的检测任务。现有方法不能很好地处理雪对物体特征识别的影响,或者通常会忽略甚至丢弃有助于提高检测性能的潜在信息。为此,作者提出了一种新颖、改进的端到端物体检测网络联合图像复原。具体地说,针对雪导致的物体检测身份退化问题,提出了一种巧妙的恢复-检测双分支网络结构,并结合多集成注意模块,可以很好地缓解雪对物体特征身份的影响,从而提高检测器的检测性能。为了更有效地利用有利于检测任务的特征,引入了自适应特征融合模块,该模块可以帮助网络更好地学习有利于检测的潜在特征,并通过特殊的特征融合消除物体区域大雪或局部大雪对检测的影响,从而提高网络在雪地中的检测能力。此外,作者还构建了一个大规模、多尺寸的雪地数据集,称为合成与真实雪地数据集(Synthetic and Real Snowy Dataset,SSD),这是对现有雪地相关任务的很好和必要的补充和改进。在公共雪景数据集(Snowy-weather Datasets)和 SRSD 上进行的大量实验表明,我们的方法优于现有的最先进的物体检测器。
{"title":"Joint image restoration for object detection in snowy weather","authors":"Jing Wang,&nbsp;Meimei Xu,&nbsp;Huazhu Xue,&nbsp;Zhanqiang Huo,&nbsp;Fen Luo","doi":"10.1049/cvi2.12274","DOIUrl":"10.1049/cvi2.12274","url":null,"abstract":"<p>Although existing object detectors achieve encouraging performance of object detection and localisation under real ideal conditions, the detection performance in adverse weather conditions (snowy) is very poor and not enough to cope with the detection task in adverse weather conditions. Existing methods do not deal well with the effect of snow on the identity of object features or usually ignore or even discard potential information that can help improve the detection performance. To this end, the authors propose a novel and improved end-to-end object detection network joint image restoration. Specifically, in order to address the problem of identity degradation of object detection due to snow, an ingenious restoration-detection dual branch network structure combined with a Multi-Integrated Attention module is proposed, which can well mitigate the effect of snow on the identity of object features, thus improving the detection performance of the detector. In order to make more effective use of the features that are beneficial to the detection task, a Self-Adaptive Feature Fusion module is introduced, which can help the network better learn the potential features that are beneficial to the detection and eliminate the effect of heavy or large local snow in the object area on detection by a special feature fusion, thus improving the network's detection capability in snowy. In addition, the authors construct a large-scale, multi-size snowy dataset called Synthetic and Real Snowy Dataset (SRSD), and it is a good and necessary complement and improvement to the existing snowy-related tasks. Extensive experiments on a public snowy dataset (Snowy-weather Datasets) and SRSD indicate that our method outperforms the existing state-of-the-art object detectors.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"759-771"},"PeriodicalIF":1.5,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12274","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140376973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tag-inferring and tag-guided Transformer for image captioning 用于图像标题的标签参考和标签引导转换器
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-03-22 DOI: 10.1049/cvi2.12280
Yaohua Yi, Yinkai Liang, Dezhu Kong, Ziwei Tang, Jibing Peng

Image captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag-inferring and tag-guided Transformer for image captioning to generate fine-grained captions. First, a tag-inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag-guided decoder that includes short-term attention to improve the features of words in the sentence and gated cross-modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors’ method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU-4 score and 135.3% CIDEr score on the MSCOCO data set.

图像标题是理解图像的一项重要任务。最近,许多研究利用标签来建立图像信息与语言信息之间的配准。然而,现有方法忽略了一个问题,即简单的语义标签难以表达不同图像内容的详细语义。因此,作者提出了一种标签参照和标签引导的图像标题转换器,以生成细粒度的标题。首先,作者提出了一种标签参考编码器,它利用场景图模型提取的标签来推断具有更深层语义信息的标签。然后,利用所获得的深层标签信息,提出了一种标签引导解码器,其中包括短期注意力来改进句子中的单词特征,以及门控跨模态注意力来结合图像特征、标签特征和语言特征,以产生信息丰富的语义特征。最后,计算序列中所有位置的单词概率分布,生成图像描述。实验证明,作者的方法可以结合标签获得精确的标题,并在 MSCOCO 数据集上获得了 40.6% 的 BLEU-4 分数和 135.3% 的 CIDEr 分数,性能极具竞争力。
{"title":"Tag-inferring and tag-guided Transformer for image captioning","authors":"Yaohua Yi,&nbsp;Yinkai Liang,&nbsp;Dezhu Kong,&nbsp;Ziwei Tang,&nbsp;Jibing Peng","doi":"10.1049/cvi2.12280","DOIUrl":"10.1049/cvi2.12280","url":null,"abstract":"<p>Image captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag-inferring and tag-guided Transformer for image captioning to generate fine-grained captions. First, a tag-inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag-guided decoder that includes short-term attention to improve the features of words in the sentence and gated cross-modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors’ method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU-4 score and 135.3% CIDEr score on the MSCOCO data set.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"801-812"},"PeriodicalIF":1.5,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12280","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140218940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IET Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1