IET Computer Vision最新文献_第9页

Zero-shot temporal event localisation: Label-free, training-free, domain-free 零样本时间事件本地化：无标签、无训练、无域

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-08-03 DOI: 10.1049/cvi2.12224

Li Sun, Ping Wang, Liuan Wang, Jun Sun, Takayuki Okatani

Temporal event localisation (TEL) has recently attracted increasing attention due to the rapid development of video platforms. Existing methods are based on either fully/weakly supervised or unsupervised learning, and thus they rely on expensive data annotation and time-consuming training. Moreover, these models, which are trained on specific domain data, limit the model generalisation to data distribution shifts. To cope with these difficulties, the authors propose a zero-shot TEL method that can operate without training data or annotations. Leveraging large-scale vision and language pre-trained models, for example, CLIP, we solve the two key problems: (1) how to find the relevant region where the event is likely to occur; (2) how to determine event duration after we find the relevant region. Query guided optimisation for local frame relevance relying on the query-to-frame relationship is proposed to find the most relevant frame region where the event is most likely to occur. Proposal generation method relying on the frame-to-frame relationship is proposed to determine the event duration. The authors also propose a greedy event sampling strategy to predict multiple durations with high reliability for the given event. The authors’ methodology is unique, offering a label-free, training-free, and domain-free approach. It enables the application of TEL purely at the testing stage. The practical results show it achieves competitive performance on the standard Charades-STA and ActivityCaptions datasets.

由于视频平台的快速发展，时间事件本地化（TEL）最近引起了越来越多的关注。现有的方法基于完全/弱监督或无监督的学习，因此它们依赖于昂贵的数据注释和耗时的训练。此外，这些基于特定领域数据训练的模型将模型泛化限制在数据分布变化上。为了应对这些困难，作者提出了一种零样本TEL方法，该方法可以在没有训练数据或注释的情况下操作。利用大规模的视觉和语言预训练模型，例如CLIP，我们解决了两个关键问题：（1）如何找到事件可能发生的相关区域；（2）如何在找到相关区域后确定事件持续时间。提出了基于查询-帧关系的局部帧相关性的查询导向优化，以找到事件最有可能发生的最相关的帧区域。提出了一种基于帧间关系的建议生成方法来确定事件持续时间。作者还提出了一种贪婪事件采样策略，以预测给定事件的多个高可靠性持续时间。作者的方法是独特的，提供了一种无标签、无训练和无领域的方法。它使TEL的应用完全处于测试阶段。实际结果表明，它在标准Charades STA和ActivityCaptions数据集上实现了具有竞争力的性能。

{"title":"Zero-shot temporal event localisation: Label-free, training-free, domain-free","authors":"Li Sun, Ping Wang, Liuan Wang, Jun Sun, Takayuki Okatani","doi":"10.1049/cvi2.12224","DOIUrl":"https://doi.org/10.1049/cvi2.12224","url":null,"abstract":"Temporal event localisation (TEL) has recently attracted increasing attention due to the rapid development of video platforms. Existing methods are based on either fully/weakly supervised or unsupervised learning, and thus they rely on expensive data annotation and time-consuming training. Moreover, these models, which are trained on specific domain data, limit the model generalisation to data distribution shifts. To cope with these difficulties, the authors propose a zero-shot TEL method that can operate without training data or annotations. Leveraging large-scale vision and language pre-trained models, for example, CLIP, we solve the two key problems: (1) how to find the relevant region where the event is likely to occur; (2) how to determine event duration after we find the relevant region. Query guided optimisation for local frame relevance relying on the query-to-frame relationship is proposed to find the most relevant frame region where the event is most likely to occur. Proposal generation method relying on the frame-to-frame relationship is proposed to determine the event duration. The authors also propose a greedy event sampling strategy to predict multiple durations with high reliability for the given event. The authors’ methodology is unique, offering a label-free, training-free, and domain-free approach. It enables the application of TEL purely at the testing stage. The practical results show it achieves competitive performance on the standard Charades-STA and ActivityCaptions datasets.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 5","pages":"599-613"},"PeriodicalIF":1.7,"publicationDate":"2023-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12224","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50123617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving object detection by enhancing the effect of localisation quality evaluation on detection confidence 通过增强定位质量评价对检测置信度的影响来改进目标检测

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-07-31 DOI: 10.1049/cvi2.12227

Zuyi Wang, Wei Zhao, Li Xu

The one-stage object detector has been widely applied in many computer vision applications due to its high detection efficiency and simple framework. However, one-stage detectors heavily rely on Non-maximum Suppression to remove the duplicated predictions for the same objects, and the detectors produce detection confidence to measure the quality of those predictions. The localisation quality is an important factor to evaluate the predicted bounding boxes, but its role has not been fully utilised in previous works. To alleviate the problem, the Quality Prediction Block (QPB), a lightweight sub-network, is designed by the authors, which strengthens the effect of localisation quality evaluation on detection confidence by leveraging the features of predicted bounding boxes. The QPB is simple in structure and applies to different forms of detection confidence. Extensive experiments are conducted on the public benchmarks, MS COCO, PASCAL VOC and Berkeley DeepDrive. The results demonstrate the effectiveness of our method in the detectors with various forms of detection confidence. The proposed approach also achieves better performance in the stronger one-stage detectors.

单级物体检测器因其检测效率高、框架简单而被广泛应用于许多计算机视觉应用中。然而，单级检测器在很大程度上依赖于 "非最大抑制"（Non-maximum Suppression）来去除对同一物体的重复预测，并且检测器会产生检测置信度来衡量这些预测的质量。定位质量是评估预测边界框的一个重要因素，但在以前的工作中并未充分发挥其作用。为了缓解这一问题，作者设计了一个轻量级子网络--质量预测块（QPB），通过利用预测边界框的特征来加强定位质量评估对检测可信度的影响。QPB 结构简单，适用于不同形式的检测可信度。我们在 MS COCO、PASCAL VOC 和 Berkeley DeepDrive 等公共基准上进行了广泛的实验。结果表明，我们的方法在具有不同检测置信度的检测器中都很有效。所提出的方法在更强的单级检测器中也取得了更好的性能。

{"title":"Improving object detection by enhancing the effect of localisation quality evaluation on detection confidence","authors":"Zuyi Wang, Wei Zhao, Li Xu","doi":"10.1049/cvi2.12227","DOIUrl":"10.1049/cvi2.12227","url":null,"abstract":"The one-stage object detector has been widely applied in many computer vision applications due to its high detection efficiency and simple framework. However, one-stage detectors heavily rely on Non-maximum Suppression to remove the duplicated predictions for the same objects, and the detectors produce detection confidence to measure the quality of those predictions. The localisation quality is an important factor to evaluate the predicted bounding boxes, but its role has not been fully utilised in previous works. To alleviate the problem, the Quality Prediction Block (QPB), a lightweight sub-network, is designed by the authors, which strengthens the effect of localisation quality evaluation on detection confidence by leveraging the features of predicted bounding boxes. The QPB is simple in structure and applies to different forms of detection confidence. Extensive experiments are conducted on the public benchmarks, MS COCO, PASCAL VOC and Berkeley DeepDrive. The results demonstrate the effectiveness of our method in the detectors with various forms of detection confidence. The proposed approach also achieves better performance in the stronger one-stage detectors.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"97-109"},"PeriodicalIF":1.7,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12227","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44989905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lite-weight semantic segmentation with AG self-attention 具有AG自注意的Lite权重语义分割

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-07-28 DOI: 10.1049/cvi2.12225

Bing Liu, Yansheng Gao, Hai Li, Zhaohao Zhong, Hongwei Zhao

Due to the large computational and GPUs memory cost of semantic segmentation, some works focus on designing a lite weight model to achieve a good trade-off between computational cost and accuracy. A common method is to combined CNN and vision transformer. However, these methods ignore the contextual information of multi receptive fields. And existing methods often fail to inject detailed information losses in the downsampling of multi-scale feature. To fix these issues, we propose AG Self-Attention, which is Enhanced Atrous Self-Attention (EASA), and Gate Attention. AG Self-Attention adds the contextual information of multi receptive fields into the global semantic feature. Specifically, the Enhanced Atrous Self-Attention uses weight shared atrous convolution with different atrous rates to get the contextual information under the specific different receptive fields. Gate Attention introduces gating mechanism to inject detailed information into the global semantic feature and filter detailed information by producing “fusion” gate and “update” gate. In order to prove our insight. We conduct numerous experiments in common semantic segmentation datasets, consisting of ADE20 K, COCO-stuff, PASCAL Context, Cityscapes, to show that our method achieves state-of-the-art performance and achieve a good trade-off between computational cost and accuracy.

由于语义分割的计算成本和 GPU 内存成本都很高，因此一些研究集中于设计一个轻量级模型，以便在计算成本和准确性之间实现良好的权衡。一种常见的方法是将 CNN 与视觉转换器相结合。然而，这些方法忽略了多感受野的上下文信息。而且现有的方法往往无法在多尺度特征的下采样中注入细节信息损失。为了解决这些问题，我们提出了 AG 自我注意（AG Self-Attention），即增强型自注意（Enhanced Atrous Self-Attention，EASA）和门注意（Gate Attention）。AG 自我注意将多感受野的上下文信息添加到全局语义特征中。具体来说，增强型无齿自注意使用不同无齿率的权重共享无齿卷积来获取特定不同感受野下的上下文信息。门注意（Gate Attention）引入了门机制，通过产生 "融合 "门和 "更新 "门，向全局语义特征注入详细信息并过滤详细信息。为了证明我们的见解。我们在常见的语义分割数据集（包括 ADE20 K、COCO-stuff、PASCAL Context 和 Cityscapes）中进行了大量实验，结果表明我们的方法达到了最先进的性能，并在计算成本和准确性之间实现了良好的权衡。

{"title":"Lite-weight semantic segmentation with AG self-attention","authors":"Bing Liu, Yansheng Gao, Hai Li, Zhaohao Zhong, Hongwei Zhao","doi":"10.1049/cvi2.12225","DOIUrl":"10.1049/cvi2.12225","url":null,"abstract":"Due to the large computational and GPUs memory cost of semantic segmentation, some works focus on designing a lite weight model to achieve a good trade-off between computational cost and accuracy. A common method is to combined CNN and vision transformer. However, these methods ignore the contextual information of multi receptive fields. And existing methods often fail to inject detailed information losses in the downsampling of multi-scale feature. To fix these issues, we propose AG Self-Attention, which is Enhanced Atrous Self-Attention (EASA), and Gate Attention. AG Self-Attention adds the contextual information of multi receptive fields into the global semantic feature. Specifically, the Enhanced Atrous Self-Attention uses weight shared atrous convolution with different atrous rates to get the contextual information under the specific different receptive fields. Gate Attention introduces gating mechanism to inject detailed information into the global semantic feature and filter detailed information by producing “fusion” gate and “update” gate. In order to prove our insight. We conduct numerous experiments in common semantic segmentation datasets, consisting of ADE20 K, COCO-stuff, PASCAL Context, Cityscapes, to show that our method achieves state-of-the-art performance and achieve a good trade-off between computational cost and accuracy.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"72-83"},"PeriodicalIF":1.7,"publicationDate":"2023-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12225","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45119461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LiteCCLKNet: A lightweight criss-cross large kernel convolutional neural network for hyperspectral image classification LiteCCLKNet:用于高光谱图像分类的轻量级纵横交错大核卷积神经网络

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-07-24 DOI: 10.1049/cvi2.12218

Chengcheng Zhong, Na Gong, Zitong Zhang, Yanan Jiang, Kai Zhang

High-performance convolutional neural networks (CNNs) stack many convolutional layers to obtain powerful feature extraction capability, which leads to huge storing and computational costs. The authors focus on lightweight models for hyperspectral image (HSI) classification, so a novel lightweight criss-cross large kernel convolutional neural network (LiteCCLKNet) is proposed. Specifically, a lightweight module containing two 1D convolutions with self-attention mechanisms in orthogonal directions is presented. By setting large kernels within the 1D convolutional layers, the proposed module can efficiently aggregate long-range contextual features. In addition, the authors effectively obtain a global receptive field by stacking only two of the proposed modules. Compared with traditional lightweight CNNs, LiteCCLKNet reduces the number of parameters for easy deployment to resource-limited platforms. Experimental results on three HSI datasets demonstrate that the proposed LiteCCLKNet outperforms the previous lightweight CNNs and has higher storage efficiency.

高性能卷积神经网络(convolutional neural network, cnn)通过堆叠多个卷积层来获得强大的特征提取能力，这导致了巨大的存储和计算成本。针对高光谱图像分类的轻量化模型，提出了一种新的轻量化交叉大核卷积神经网络(LiteCCLKNet)。具体地说，提出了一个包含两个具有正交自关注机制的一维卷积的轻量级模块。通过在一维卷积层中设置大核，该模块可以有效地聚合远程上下文特征。此外，作者通过仅堆叠两个所提出的模块，有效地获得了一个全局接受场。与传统的轻量级cnn相比，LiteCCLKNet减少了参数数量，便于在资源有限的平台上部署。在3个HSI数据集上的实验结果表明，所提出的LiteCCLKNet优于以往的轻量级cnn，具有更高的存储效率。

{"title":"LiteCCLKNet: A lightweight criss-cross large kernel convolutional neural network for hyperspectral image classification","authors":"Chengcheng Zhong, Na Gong, Zitong Zhang, Yanan Jiang, Kai Zhang","doi":"10.1049/cvi2.12218","DOIUrl":"10.1049/cvi2.12218","url":null,"abstract":"High-performance convolutional neural networks (CNNs) stack many convolutional layers to obtain powerful feature extraction capability, which leads to huge storing and computational costs. The authors focus on lightweight models for hyperspectral image (HSI) classification, so a novel lightweight criss-cross large kernel convolutional neural network (LiteCCLKNet) is proposed. Specifically, a lightweight module containing two 1D convolutions with self-attention mechanisms in orthogonal directions is presented. By setting large kernels within the 1D convolutional layers, the proposed module can efficiently aggregate long-range contextual features. In addition, the authors effectively obtain a global receptive field by stacking only two of the proposed modules. Compared with traditional lightweight CNNs, LiteCCLKNet reduces the number of parameters for easy deployment to resource-limited platforms. Experimental results on three HSI datasets demonstrate that the proposed LiteCCLKNet outperforms the previous lightweight CNNs and has higher storage efficiency.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"763-776"},"PeriodicalIF":1.7,"publicationDate":"2023-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12218","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46053001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dynamic facial expression recognition with pseudo-label guided multi-modal pre-training 基于伪标签引导的多模式预训练的动态人脸表情识别

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-07-21 DOI: 10.1049/cvi2.12217

Bing Yin, Shi Yin, Cong Liu, Yanyong Zhang, Changfeng Xi, Baocai Yin, Zhenhua Ling

Due to the huge cost of manual annotations, the labelled data may not be sufficient to train a dynamic facial expression (DFR) recogniser with good performance. To address this, the authors propose a multi-modal pre-training method with a pseudo-label guidance mechanism to make full use of unlabelled video data for learning informative representations of facial expressions. First, the authors build a pre-training dataset of videos with aligned vision and audio modals. Second, the vision and audio feature encoders are trained through an instance discrimination strategy and a cross-modal alignment strategy on the pre-training data. Third, the vision feature encoder is extended as a dynamic expression recogniser and is fine-tuned on the labelled training data. Fourth, the fine-tuned expression recogniser is adopted to predict pseudo-labels for the pre-training data, and then start a new pre-training phase with the guidance of pseudo-labels to alleviate the long-tail distribution problem and the instance-class confliction. Fifth, since the representations learnt with the guidance of pseudo-labels are more informative, a new fine-tuning phase is added to further boost the generalisation performance on the DFR recognition task. Experimental results on the Dynamic Facial Expression in the Wild dataset demonstrate the superiority of the proposed method.

由于人工标注成本高昂，标注数据可能不足以训练出性能良好的动态面部表情（DFR）识别器。为了解决这个问题，作者提出了一种带有伪标签引导机制的多模态预训练方法，以充分利用未标记的视频数据来学习面部表情的信息表征。首先，作者建立了一个预训练视频数据集，其中包含对齐的视觉和音频模态。其次，在预训练数据上通过实例辨别策略和跨模态对齐策略训练视觉和音频特征编码器。第三，将视觉特征编码器扩展为动态表情识别器，并在标记的训练数据上进行微调。第四，采用微调后的表情识别器预测预训练数据的伪标签，然后在伪标签的指导下开始新的预训练阶段，以缓解长尾分布问题和实例类冲突。第五，由于在伪标签指导下学习到的表征信息量更大，因此增加了一个新的微调阶段，以进一步提高 DFR 识别任务的泛化性能。在野外动态面部表情数据集上的实验结果证明了所提方法的优越性。

{"title":"Dynamic facial expression recognition with pseudo-label guided multi-modal pre-training","authors":"Bing Yin, Shi Yin, Cong Liu, Yanyong Zhang, Changfeng Xi, Baocai Yin, Zhenhua Ling","doi":"10.1049/cvi2.12217","DOIUrl":"10.1049/cvi2.12217","url":null,"abstract":"Due to the huge cost of manual annotations, the labelled data may not be sufficient to train a dynamic facial expression (DFR) recogniser with good performance. To address this, the authors propose a multi-modal pre-training method with a pseudo-label guidance mechanism to make full use of unlabelled video data for learning informative representations of facial expressions. First, the authors build a pre-training dataset of videos with aligned vision and audio modals. Second, the vision and audio feature encoders are trained through an instance discrimination strategy and a cross-modal alignment strategy on the pre-training data. Third, the vision feature encoder is extended as a dynamic expression recogniser and is fine-tuned on the labelled training data. Fourth, the fine-tuned expression recogniser is adopted to predict pseudo-labels for the pre-training data, and then start a new pre-training phase with the guidance of pseudo-labels to alleviate the long-tail distribution problem and the instance-class confliction. Fifth, since the representations learnt with the guidance of pseudo-labels are more informative, a new fine-tuning phase is added to further boost the generalisation performance on the DFR recognition task. Experimental results on the Dynamic Facial Expression in the Wild dataset demonstrate the superiority of the proposed method.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"33-45"},"PeriodicalIF":1.7,"publicationDate":"2023-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12217","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46222126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Position-aware spatio-temporal graph convolutional networks for skeleton-based action recognition 用于基于骨架的动作识别的位置感知时空图卷积网络

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-07-13 DOI: 10.1049/cvi2.12223

Ping Yang, Qin Wang, Hao Chen, Zizhao Wu

Graph Convolutional Networks (GCNs) have been widely used in skeleton-based action recognition. Though significant performance has been achieved, it is still challenging to effectively model the complex dynamics of skeleton sequences. A novel position-aware spatio-temporal GCN for skeleton-based action recognition is proposed, where the positional encoding is investigated to enhance the capacity of typical baselines for comprehending the dynamic characteristics of action sequence. Specifically, the authors’ method systematically investigates the temporal position encoding and spatial position embedding, in favour of explicitly capturing the sequence ordering information and the identity information of nodes that are used in graphs. Additionally, to alleviate the redundancy and over-smoothing problems of typical GCNs, the authors’ method further investigates a subgraph mask, which gears to mine the prominent subgraph patterns over the underlying graph, letting the model be robust against the impaction of some irrelevant joints. Extensive experiments on three large-scale datasets demonstrate that our model can achieve competitive results comparing to the previous state-of-art methods.

图卷积网络(GCNs)已广泛应用于基于骨架的动作识别。尽管已经取得了显著的成绩，但有效地模拟骨骼序列的复杂动力学仍然是一个挑战。提出了一种新的基于骨架的动作识别的位置感知时空GCN，研究了位置编码，增强了典型基线对动作序列动态特征的理解能力。具体来说，作者的方法系统地研究了时间位置编码和空间位置嵌入，有利于明确地捕获图中使用的序列顺序信息和节点的身份信息。此外，为了减轻典型GCNs的冗余和过度平滑问题，作者的方法进一步研究了子图掩码，该掩码用于挖掘底层图上突出的子图模式，使模型对一些不相关关节的影响具有鲁棒性。在三个大规模数据集上进行的大量实验表明，与之前的先进方法相比，我们的模型可以获得具有竞争力的结果。

{"title":"Position-aware spatio-temporal graph convolutional networks for skeleton-based action recognition","authors":"Ping Yang, Qin Wang, Hao Chen, Zizhao Wu","doi":"10.1049/cvi2.12223","DOIUrl":"10.1049/cvi2.12223","url":null,"abstract":"Graph Convolutional Networks (GCNs) have been widely used in skeleton-based action recognition. Though significant performance has been achieved, it is still challenging to effectively model the complex dynamics of skeleton sequences. A novel position-aware spatio-temporal GCN for skeleton-based action recognition is proposed, where the positional encoding is investigated to enhance the capacity of typical baselines for comprehending the dynamic characteristics of action sequence. Specifically, the authors’ method systematically investigates the temporal position encoding and spatial position embedding, in favour of explicitly capturing the sequence ordering information and the identity information of nodes that are used in graphs. Additionally, to alleviate the redundancy and over-smoothing problems of typical GCNs, the authors’ method further investigates a subgraph mask, which gears to mine the prominent subgraph patterns over the underlying graph, letting the model be robust against the impaction of some irrelevant joints. Extensive experiments on three large-scale datasets demonstrate that our model can achieve competitive results comparing to the previous state-of-art methods.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"17 7","pages":"844-854"},"PeriodicalIF":1.7,"publicationDate":"2023-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12223","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48043129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A point-image fusion network for event-based frame interpolation 基于事件帧插值的点图像融合网络

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-07-10 DOI: 10.1049/cvi2.12220

Chushu Zhang, Wei An, Ye Zhang, Miao Li

Temporal information in event streams plays a critical role in event-based video frame interpolation as it provides temporal context cues complementary to images. Most previous event-based methods first transform the unstructured event data to structured data formats through voxelisation, and then employ advanced CNNs to extract temporal information. However, voxelisation inevitably leads to information loss, and processing the sparse voxels introduces severe computation redundancy. To address these limitations, this study proposes a point-image fusion network (PIFNet). In our PIFNet, rich temporal information from the events can be directly extracted at the point level. Then, a fusion module is designed to fuse complementary cues from both points and images for frame interpolation. Extensive experiments on both synthetic and real datasets demonstrate that our PIFNet achieves state-of-the-art performance with high efficiency.

事件流中的时间信息在基于事件的视频帧插值中起着至关重要的作用，因为它提供了与图像互补的时间背景线索。以往大多数基于事件的方法首先通过象素化将非结构化事件数据转换为结构化数据格式，然后采用先进的 CNN 提取时间信息。然而，象素化不可避免地会导致信息丢失，而且处理稀疏的象素会带来严重的计算冗余。为了解决这些局限性，本研究提出了点-图像融合网络（PIFNet）。在我们的 PIFNet 中，可以直接在点级别提取事件中丰富的时间信息。然后，设计一个融合模块来融合来自点和图像的互补线索，以进行帧插值。在合成数据集和真实数据集上进行的大量实验证明，我们的 PIFNet 达到了最先进的高效性能。

引用次数: 0

Enhancing human parsing with region-level learning 增强人类解析与区域级学习

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-07-05 DOI: 10.1049/cvi2.12222

Yanghong Zhou, P. Y. Mok

Human parsing is very important in a diverse range of industrial applications. Despite the considerable progress that has been achieved, the performance of existing methods is still less than satisfactory, since these methods learn the shared features of various parsing labels at the image level. This limits the representativeness of the learnt features, especially when the distribution of parsing labels is imbalanced or the scale of different labels is substantially different. To address this limitation, a Region-level Parsing Refiner (RPR) is proposed to enhance parsing performance by the introduction of region-level parsing learning. Region-level parsing focuses specifically on small regions of the body, for example, the head. The proposed RPR is an adaptive module that can be integrated with different existing human parsing models to improve their performance. Extensive experiments are conducted on two benchmark datasets, and the results demonstrated the effectiveness of our RPR model in terms of improving the overall parsing performance as well as parsing rare labels. This method was successfully applied to a commercial application for the extraction of human body measurements and has been used in various online shopping platforms for clothing size recommendations. The code and dataset are released at this link https://github.com/applezhouyp/PRP.

人工解析在各种工业应用中都非常重要。尽管已经取得了相当大的进展，但现有方法的性能仍不尽如人意，因为这些方法是在图像层面学习各种解析标签的共享特征。这就限制了所学特征的代表性，尤其是当解析标签的分布不平衡或不同标签的规模有很大差异时。为了解决这一局限性，我们提出了区域级解析精炼器（RPR），通过引入区域级解析学习来提高解析性能。区域级解析专门针对身体的小区域，例如头部。所提出的 RPR 是一个自适应模块，可与现有的不同人类解析模型集成，以提高其性能。我们在两个基准数据集上进行了广泛的实验，结果表明我们的 RPR 模型在提高整体解析性能和解析稀有标签方面非常有效。该方法已成功应用于提取人体测量数据的商业应用中，并被各种在线购物平台用于服装尺寸推荐。代码和数据集在此链接 https://github.com/applezhouyp/PRP 上发布。

{"title":"Enhancing human parsing with region-level learning","authors":"Yanghong Zhou, P. Y. Mok","doi":"10.1049/cvi2.12222","DOIUrl":"10.1049/cvi2.12222","url":null,"abstract":"Human parsing is very important in a diverse range of industrial applications. Despite the considerable progress that has been achieved, the performance of existing methods is still less than satisfactory, since these methods learn the shared features of various parsing labels at the image level. This limits the representativeness of the learnt features, especially when the distribution of parsing labels is imbalanced or the scale of different labels is substantially different. To address this limitation, a Region-level Parsing Refiner (RPR) is proposed to enhance parsing performance by the introduction of region-level parsing learning. Region-level parsing focuses specifically on small regions of the body, for example, the head. The proposed RPR is an adaptive module that can be integrated with different existing human parsing models to improve their performance. Extensive experiments are conducted on two benchmark datasets, and the results demonstrated the effectiveness of our RPR model in terms of improving the overall parsing performance as well as parsing rare labels. This method was successfully applied to a commercial application for the extraction of human body measurements and has been used in various online shopping platforms for clothing size recommendations. The code and dataset are released at this link https://github.com/applezhouyp/PRP.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"60-71"},"PeriodicalIF":1.7,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12222","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46633684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CAGAN: Classifier-augmented generative adversarial networks for weakly-supervised COVID-19 lung lesion localisation CAGAN:用于弱监督COVID - 19肺病变定位的分类器增强生成对抗网络

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-07-03 DOI: 10.1049/cvi2.12216

Xiaojie Li, Xin Fei, Zhe Yan, Hongping Ren, Canghong Shi, Xian Zhang, Imran Mumtaz, Yong Luo, Xi Wu

The Coronavirus Disease 2019 (COVID-19) epidemic has constituted a Public Health Emergency of International Concern. Chest computed tomography (CT) can help early reveal abnormalities indicative of lung disease. Thus, accurate and automatic localisation of lung lesions is particularly important to assist physicians in rapid diagnosis of COVID-19 patients. The authors propose a classifier-augmented generative adversarial network framework for weakly supervised COVID-19 lung lesion localisation. It consists of an abnormality map generator, discriminator and classifier. The generator aims to produce the abnormality feature map M to locate lesion regions and then constructs images of the pseudo-healthy subjects by adding M to the input patient images. Besides constraining the generated images of healthy subjects with real distribution by the discriminator, a pre-trained classifier is introduced to enhance the generated images of healthy subjects to possess similar feature representations with real healthy people in terms of high-level semantic features. Moreover, an attention gate is employed in the generator to reduce the noise effect in the irrelevant regions of M. Experimental results on the COVID-19 CT dataset show that the method is effective in capturing more lesion areas and generating less noise in unrelated areas, and it has significant advantages in terms of quantitative and qualitative results over existing methods.

2019年冠状病毒病（COVID-19）疫情已构成国际关注的突发公共卫生事件。胸部计算机断层扫描（CT）有助于及早发现提示肺部疾病的异常情况。因此，准确、自动地定位肺部病变对于协助医生快速诊断 COVID-19 患者尤为重要。作者提出了一种用于弱监督 COVID-19 肺部病变定位的分类器增强生成对抗网络框架。它由异常图生成器、判别器和分类器组成。生成器旨在生成用于定位病变区域的异常特征图 M，然后将 M 添加到输入的患者图像中，构建伪健康受试者图像。除了通过判别器对生成的健康受试者图像进行真实分布的约束外，还引入了预训练分类器，以增强生成的健康受试者图像在高级语义特征方面与真实健康人具有相似的特征表示。在 COVID-19 CT 数据集上的实验结果表明，该方法能有效捕捉到更多的病变区域，并在无关区域产生较少的噪声，与现有方法相比在定量和定性方面都有显著优势。

{"title":"CAGAN: Classifier-augmented generative adversarial networks for weakly-supervised COVID-19 lung lesion localisation","authors":"Xiaojie Li, Xin Fei, Zhe Yan, Hongping Ren, Canghong Shi, Xian Zhang, Imran Mumtaz, Yong Luo, Xi Wu","doi":"10.1049/cvi2.12216","DOIUrl":"10.1049/cvi2.12216","url":null,"abstract":"The Coronavirus Disease 2019 (COVID-19) epidemic has constituted a Public Health Emergency of International Concern. Chest computed tomography (CT) can help early reveal abnormalities indicative of lung disease. Thus, accurate and automatic localisation of lung lesions is particularly important to assist physicians in rapid diagnosis of COVID-19 patients. The authors propose a classifier-augmented generative adversarial network framework for weakly supervised COVID-19 lung lesion localisation. It consists of an abnormality map generator, discriminator and classifier. The generator aims to produce the abnormality feature map M to locate lesion regions and then constructs images of the pseudo-healthy subjects by adding M to the input patient images. Besides constraining the generated images of healthy subjects with real distribution by the discriminator, a pre-trained classifier is introduced to enhance the generated images of healthy subjects to possess similar feature representations with real healthy people in terms of high-level semantic features. Moreover, an attention gate is employed in the generator to reduce the noise effect in the irrelevant regions of M. Experimental results on the COVID-19 CT dataset show that the method is effective in capturing more lesion areas and generating less noise in unrelated areas, and it has significant advantages in terms of quantitative and qualitative results over existing methods.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"1-14"},"PeriodicalIF":1.7,"publicationDate":"2023-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12216","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46620917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mirror complementary transformer network for RGB-thermal salient object detection 用于 RGB 热敏突出物体检测的镜像互补变压器网络

IF 1.7 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2023-06-28 DOI: 10.1049/cvi2.12221

Xiurong Jiang, Yifan Hou, Hui Tian, Lin Zhu

Conventional RGB-T salient object detection treats RGB and thermal modalities equally to locate the common salient regions. However, the authors observed that the rich colour and texture information of the RGB modality makes the objects more prominent compared to the background; and the thermal modality records the temperature difference of the scene, so the objects usually contain clear and continuous edge information. In this work, a novel mirror-complementary Transformer network (MCNet) is proposed for RGB-T SOD, which supervise the two modalities separately with a complementary set of saliency labels under a symmetrical structure. Moreover, the attention-based feature interaction and serial multiscale dilated convolution (SDC)-based feature fusion modules are introduced to make the two modalities complement and adjust each other flexibly. When one modality fails, the proposed model can still accurately segment the salient regions. To demonstrate the robustness of the proposed model under challenging scenes in real world, the authors build a novel RGB-T SOD dataset VT723 based on a large public semantic segmentation RGB-T dataset used in the autonomous driving domain. Extensive experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches, including CNN-based and Transformer-based methods. The code and dataset can be found at https://github.com/jxr326/SwinMCNet.

传统的 RGB-T 突出物体检测对 RGB 和热成像模式一视同仁，以找出共同的突出区域。然而，作者观察到，RGB 模式丰富的色彩和纹理信息使物体与背景相比更加突出；而热模式记录了场景的温度差，因此物体通常包含清晰、连续的边缘信息。本研究针对 RGB-T SOD 提出了一种新颖的镜像互补变换器网络（MCNet），该网络在对称结构下使用一组互补的显著性标签对两种模态分别进行监督。此外，还引入了基于注意力的特征交互和基于串行多尺度扩张卷积（SDC）的特征融合模块，使两种模态能够灵活地互补和调整。当一种模式失效时，所提出的模型仍能准确分割出突出区域。为了证明所提模型在现实世界充满挑战的场景下的鲁棒性，作者基于自动驾驶领域使用的大型公开语义分割 RGB-T 数据集，建立了一个新颖的 RGB-T SOD 数据集 VT723。在基准数据集和 VT723 数据集上进行的大量实验表明，所提出的方法优于最先进的方法，包括基于 CNN 的方法和基于变换器的方法。代码和数据集见 https://github.com/jxr326/SwinMCNet。

{"title":"Mirror complementary transformer network for RGB-thermal salient object detection","authors":"Xiurong Jiang, Yifan Hou, Hui Tian, Lin Zhu","doi":"10.1049/cvi2.12221","DOIUrl":"10.1049/cvi2.12221","url":null,"abstract":"Conventional RGB-T salient object detection treats RGB and thermal modalities equally to locate the common salient regions. However, the authors observed that the rich colour and texture information of the RGB modality makes the objects more prominent compared to the background; and the thermal modality records the temperature difference of the scene, so the objects usually contain clear and continuous edge information. In this work, a novel mirror-complementary Transformer network (MCNet) is proposed for RGB-T SOD, which supervise the two modalities separately with a complementary set of saliency labels under a symmetrical structure. Moreover, the attention-based feature interaction and serial multiscale dilated convolution (SDC)-based feature fusion modules are introduced to make the two modalities complement and adjust each other flexibly. When one modality fails, the proposed model can still accurately segment the salient regions. To demonstrate the robustness of the proposed model under challenging scenes in real world, the authors build a novel RGB-T SOD dataset VT723 based on a large public semantic segmentation RGB-T dataset used in the autonomous driving domain. Extensive experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches, including CNN-based and Transformer-based methods. The code and dataset can be found at https://github.com/jxr326/SwinMCNet.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 1","pages":"15-32"},"PeriodicalIF":1.7,"publicationDate":"2023-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12221","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135155949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0