IET Computer Vision最新文献_第5页

SPANet: Spatial perceptual activation network for camouflaged object detection

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-09-18 DOI: 10.1049/cvi2.12310

Jianhao Zhang, Gang Yang, Xun Dai, Pengyu Yang

Camouflaged object detection (COD) aims to segment objects embedded in the environment from the background. Most existing methods are easily affected by background interference in cluttered environments and cannot accurately locate camouflage areas, resulting in over-segmentation or incomplete segmentation structures. To effectively improve the performance of COD, we propose a spatial perceptual activation network (SPANet). SPANet extracts the spatial positional relationship between each object in the scene by activating spatial perception and uses it as global information to guide segmentation. It mainly consists of three modules: perceptual activation module (PAM), feature inference module (FIM), and interaction recovery module (IRM). Specifically, the authors design a PAM to model the positional relationship between the camouflaged object and the surrounding environment to obtain semantic correlation information. Then, a FIM that can effectively combine correlation information to suppress background interference and re-encode to generate multi-scale features is proposed. In addition, to further fuse multi-scale features, an IRM to mine the complementary information and differences between features at different scales is designed. Extensive experimental results on four widely used benchmark datasets (i.e. CAMO, CHAMELEON, COD10K, and NC4K) show that the authors’ method outperforms 13 state-of-the-art methods.

{"title":"SPANet: Spatial perceptual activation network for camouflaged object detection","authors":"Jianhao Zhang, Gang Yang, Xun Dai, Pengyu Yang","doi":"10.1049/cvi2.12310","DOIUrl":"https://doi.org/10.1049/cvi2.12310","url":null,"abstract":"Camouflaged object detection (COD) aims to segment objects embedded in the environment from the background. Most existing methods are easily affected by background interference in cluttered environments and cannot accurately locate camouflage areas, resulting in over-segmentation or incomplete segmentation structures. To effectively improve the performance of COD, we propose a spatial perceptual activation network (SPANet). SPANet extracts the spatial positional relationship between each object in the scene by activating spatial perception and uses it as global information to guide segmentation. It mainly consists of three modules: perceptual activation module (PAM), feature inference module (FIM), and interaction recovery module (IRM). Specifically, the authors design a PAM to model the positional relationship between the camouflaged object and the surrounding environment to obtain semantic correlation information. Then, a FIM that can effectively combine correlation information to suppress background interference and re-encode to generate multi-scale features is proposed. In addition, to further fuse multi-scale features, an IRM to mine the complementary information and differences between features at different scales is designed. Extensive experimental results on four widely used benchmark datasets (i.e. CAMO, CHAMELEON, COD10K, and NC4K) show that the authors’ method outperforms 13 state-of-the-art methods.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1300-1312"},"PeriodicalIF":1.5,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12310","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SRL-ProtoNet: Self-supervised representation learning for few-shot remote sensing scene classification SRL-ProtoNet：用于少镜头遥感场景分类的自监督表示学习

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-09-02 DOI: 10.1049/cvi2.12304

Bing Liu, Hongwei Zhao, Jiao Li, Yansheng Gao, Jianrong Zhang

Using a deep learning method to classify a large amount of labelled remote sensing scene data produces good performance. However, it is challenging for deep learning based methods to generalise to classification tasks with limited data. Few-shot learning allows neural networks to classify unseen categories when confronted with a handful of labelled data. Currently, episodic tasks based on meta-learning can effectively complete few-shot classification, and training an encoder that can conduct representation learning has become an important component of few-shot learning. An end-to-end few-shot remote sensing scene classification model based on ProtoNet and self-supervised learning is proposed. The authors design the Pre-prototype for a more discrete feature space and better integration with self-supervised learning, and also propose the ProtoMixer for higher quality prototypes with a global receptive field. The authors’ method outperforms the existing state-of-the-art self-supervised based methods on three widely used benchmark datasets: UC-Merced, NWPU-RESISC45, and AID. Compare with previous state-of-the-art performance. For the one-shot setting, this method improves by 1.21%, 2.36%, and 0.84% in AID, UC-Merced, and NWPU-RESISC45, respectively. For the five-shot setting, this method surpasses by 0.85%, 2.79%, and 0.74% in the AID, UC-Merced, and NWPU-RESISC45, respectively.

使用深度学习方法对大量已标记的遥感场景数据进行分类，能产生良好的效果。然而，要将基于深度学习的方法推广到数据有限的分类任务中，却是一项挑战。少量学习允许神经网络在面对少量标记数据时对未见类别进行分类。目前，基于元学习（meta-learning）的偶发任务可以有效地完成少点分类，而训练一个可以进行表征学习的编码器已成为少点学习的重要组成部分。本文提出了一种基于 ProtoNet 和自监督学习的端到端少镜头遥感场景分类模型。作者设计了预原型（Pre-prototype），以获得更离散的特征空间，并与自监督学习更好地结合；还提出了原型混合器（ProtoMixer），以获得具有全局感受野的高质量原型。在三个广泛使用的基准数据集上，作者的方法优于现有最先进的基于自我监督的方法：UC-Merced, NWPU-RESISC45 和 AID。与之前最先进方法的性能相比。在单次触发设置中，该方法在 AID、UC-Merced 和 NWPU-RESISC45 中的性能分别提高了 1.21%、2.36% 和 0.84%。在五次搜索设置中，该方法在 AID、UC-Merced 和 NWPU-RESISC45 中分别超越了 0.85%、2.79% 和 0.74%。

{"title":"SRL-ProtoNet: Self-supervised representation learning for few-shot remote sensing scene classification","authors":"Bing Liu, Hongwei Zhao, Jiao Li, Yansheng Gao, Jianrong Zhang","doi":"10.1049/cvi2.12304","DOIUrl":"https://doi.org/10.1049/cvi2.12304","url":null,"abstract":"Using a deep learning method to classify a large amount of labelled remote sensing scene data produces good performance. However, it is challenging for deep learning based methods to generalise to classification tasks with limited data. Few-shot learning allows neural networks to classify unseen categories when confronted with a handful of labelled data. Currently, episodic tasks based on meta-learning can effectively complete few-shot classification, and training an encoder that can conduct representation learning has become an important component of few-shot learning. An end-to-end few-shot remote sensing scene classification model based on ProtoNet and self-supervised learning is proposed. The authors design the Pre-prototype for a more discrete feature space and better integration with self-supervised learning, and also propose the ProtoMixer for higher quality prototypes with a global receptive field. The authors’ method outperforms the existing state-of-the-art self-supervised based methods on three widely used benchmark datasets: UC-Merced, NWPU-RESISC45, and AID. Compare with previous state-of-the-art performance. For the one-shot setting, this method improves by 1.21%, 2.36%, and 0.84% in AID, UC-Merced, and NWPU-RESISC45, respectively. For the five-shot setting, this method surpasses by 0.85%, 2.79%, and 0.74% in the AID, UC-Merced, and NWPU-RESISC45, respectively.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1034-1042"},"PeriodicalIF":1.5,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12304","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving unsupervised pedestrian re-identification with enhanced feature representation and robust clustering

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-08-26 DOI: 10.1049/cvi2.12309

Jiang Luo, Lingjun Liu

Pedestrian re-identification (re-ID) is an important research direction in computer vision, with extensive applications in pattern recognition and monitoring systems. Due to uneven data distribution, and the need to solve clustering standards and similarity evaluation problems, the performance of unsupervised methods is limited. To address these issues, an improved unsupervised re-ID method, called Enhanced Feature Representation and Robust Clustering (EFRRC), which combines EFRRC is proposed. First, a relation network that considers the relations between each part of the pedestrian's body and other parts is introduced, thereby obtaining more discriminative feature representations. The network makes the feature at the single-part level also contain partial information of other body parts, making it more discriminative. A global contrastive pooling (GCP) module is introduced to obtain the global features of the image. Second, a dispersion-based clustering method, which can effectively evaluate the quality of clustering and discover potential patterns in the data is designed. This approach considers a wider context of sample-level pairwise relationships for robust cluster affinity assessment. It effectively addresses challenges posed by imbalanced data distributions in complex situations. The above structures are connected through a clustering contrastive learning framework, which not only improves the discriminative power of features and the accuracy of clustering, but also solves the problem of inconsistent clustering updates. Experimental results on three public datasets demonstrate the superiority of our method over existing unsupervised re-ID methods.

{"title":"Improving unsupervised pedestrian re-identification with enhanced feature representation and robust clustering","authors":"Jiang Luo, Lingjun Liu","doi":"10.1049/cvi2.12309","DOIUrl":"https://doi.org/10.1049/cvi2.12309","url":null,"abstract":"Pedestrian re-identification (re-ID) is an important research direction in computer vision, with extensive applications in pattern recognition and monitoring systems. Due to uneven data distribution, and the need to solve clustering standards and similarity evaluation problems, the performance of unsupervised methods is limited. To address these issues, an improved unsupervised re-ID method, called Enhanced Feature Representation and Robust Clustering (EFRRC), which combines EFRRC is proposed. First, a relation network that considers the relations between each part of the pedestrian's body and other parts is introduced, thereby obtaining more discriminative feature representations. The network makes the feature at the single-part level also contain partial information of other body parts, making it more discriminative. A global contrastive pooling (GCP) module is introduced to obtain the global features of the image. Second, a dispersion-based clustering method, which can effectively evaluate the quality of clustering and discover potential patterns in the data is designed. This approach considers a wider context of sample-level pairwise relationships for robust cluster affinity assessment. It effectively addresses challenges posed by imbalanced data distributions in complex situations. The above structures are connected through a clustering contrastive learning framework, which not only improves the discriminative power of features and the accuracy of clustering, but also solves the problem of inconsistent clustering updates. Experimental results on three public datasets demonstrate the superiority of our method over existing unsupervised re-ID methods.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1097-1111"},"PeriodicalIF":1.5,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12309","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing semi-supervised contrastive learning through saliency map for diabetic retinopathy grading

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-08-26 DOI: 10.1049/cvi2.12308

Jiacheng Zhang, Rong Jin, Wenqiang Liu

Diabetic retinopathy (DR) is a severe ophthalmic condition that can lead to blindness if not diagnosed and provided timely treatment. Hence, the development of efficient automated DR grading systems is crucial for early screening and treatment. Although progress has been made in DR detection using deep learning techniques, these methods still face challenges in handling the complexity of DR lesion characteristics and the nuances in grading criteria. Moreover, the performance of these algorithms is hampered by the scarcity of large-scale, high-quality annotated data. An innovative semi-supervised fundus image DR grading framework is proposed, employing a saliency estimation map to bolster the model's perception of fundus structures, thereby improving the differentiation between lesions and healthy regions. By integrating semi-supervised and contrastive learning, the model's ability to recognise inter-class and intra-class variations in DR grading is enhanced, allowing for precise discrimination of various lesion features. Experiments conducted on publicly available DR grading datasets, such as EyePACS and Messidor, have validated the effectiveness of our proposed method. Specifically, our approach outperforms the state of the art on the kappa metric by 0.8% on the full EyePACS dataset and by 3.2% on a 10% subset of EyePACS, demonstrating its superiority over previous methodologies. The authors’ code is publicly available at https://github.com/500ZhangJC/SCL-SEM-framework-for-DR-Grading.

{"title":"Enhancing semi-supervised contrastive learning through saliency map for diabetic retinopathy grading","authors":"Jiacheng Zhang, Rong Jin, Wenqiang Liu","doi":"10.1049/cvi2.12308","DOIUrl":"https://doi.org/10.1049/cvi2.12308","url":null,"abstract":"Diabetic retinopathy (DR) is a severe ophthalmic condition that can lead to blindness if not diagnosed and provided timely treatment. Hence, the development of efficient automated DR grading systems is crucial for early screening and treatment. Although progress has been made in DR detection using deep learning techniques, these methods still face challenges in handling the complexity of DR lesion characteristics and the nuances in grading criteria. Moreover, the performance of these algorithms is hampered by the scarcity of large-scale, high-quality annotated data. An innovative semi-supervised fundus image DR grading framework is proposed, employing a saliency estimation map to bolster the model's perception of fundus structures, thereby improving the differentiation between lesions and healthy regions. By integrating semi-supervised and contrastive learning, the model's ability to recognise inter-class and intra-class variations in DR grading is enhanced, allowing for precise discrimination of various lesion features. Experiments conducted on publicly available DR grading datasets, such as EyePACS and Messidor, have validated the effectiveness of our proposed method. Specifically, our approach outperforms the state of the art on the kappa metric by 0.8% on the full EyePACS dataset and by 3.2% on a 10% subset of EyePACS, demonstrating its superiority over previous methodologies. The authors’ code is publicly available at https://github.com/500ZhangJC/SCL-SEM-framework-for-DR-Grading.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 8","pages":"1127-1137"},"PeriodicalIF":1.5,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12308","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143253267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Balanced parametric body prior for implicit clothed human reconstruction from a monocular RGB 根据单目 RGB 重建隐式衣着人体的平衡参数人体先验图

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-08-25 DOI: 10.1049/cvi2.12306

Rong Xue, Jiefeng Li, Cewu Lu

The authors study the problem of reconstructing detailed 3D human surfaces in various poses and clothing from images. The parametric human body allows accurate 3D clothed human reconstruction. However, the offset of large and loose clothing from the inferred parametric body mesh confines the generalisation of the existing parametric body-based methods. A distinctive method that simultaneously generalises well to unseen poses and unseen clothing is proposed. The authors first discover the unbalanced nature of existing implicit function-based methods. To address this issue, the authors propose to synthesise the balanced training samples with a new dependency coefficient in training. The dependency coefficient can tell the network whether the prior from the parametric body model is reliable. The authors then design a novel positional embedding-based attenuation strategy to incorporate the dependency coefficient into the implicit function (IF) network. Comprehensive experiments are conducted on the CAPE dataset to study the effectiveness of the authors’ approach. The proposed method significantly surpasses state-of-the-art approaches and generalises well on unseen poses and clothing. As an illustrative example, the proposed method improves the Chamfer Distance Error and Normal Error by 38.2% and 57.6%.

作者研究了从图像中重建各种姿势和穿着的详细三维人体表面的问题。参数化人体可以精确地重建三维衣着人体。然而，从推断出的参数人体网格中偏移大而宽松的衣服限制了现有基于参数人体的方法的通用性。本文提出了一种与众不同的方法，可同时很好地概括未见姿势和未见服装。作者首先发现了现有基于隐函数方法的不平衡性。为解决这一问题，作者建议在训练中使用新的依赖系数合成平衡训练样本。依赖系数可以告诉网络来自参数身体模型的先验是否可靠。然后，作者设计了一种新颖的基于位置嵌入的衰减策略，将依赖系数纳入隐函数（IF）网络。在 CAPE 数据集上进行了综合实验，以研究作者方法的有效性。所提出的方法大大超越了最先进的方法，并能很好地泛化到未见过的姿势和服装上。举例来说，所提出的方法将倒角距离误差和正常误差分别提高了 38.2% 和 57.6%。

{"title":"Balanced parametric body prior for implicit clothed human reconstruction from a monocular RGB","authors":"Rong Xue, Jiefeng Li, Cewu Lu","doi":"10.1049/cvi2.12306","DOIUrl":"https://doi.org/10.1049/cvi2.12306","url":null,"abstract":"The authors study the problem of reconstructing detailed 3D human surfaces in various poses and clothing from images. The parametric human body allows accurate 3D clothed human reconstruction. However, the offset of large and loose clothing from the inferred parametric body mesh confines the generalisation of the existing parametric body-based methods. A distinctive method that simultaneously generalises well to unseen poses and unseen clothing is proposed. The authors first discover the unbalanced nature of existing implicit function-based methods. To address this issue, the authors propose to synthesise the balanced training samples with a new dependency coefficient in training. The dependency coefficient can tell the network whether the prior from the parametric body model is reliable. The authors then design a novel positional embedding-based attenuation strategy to incorporate the dependency coefficient into the implicit function (IF) network. Comprehensive experiments are conducted on the CAPE dataset to study the effectiveness of the authors’ approach. The proposed method significantly surpasses state-of-the-art approaches and generalises well on unseen poses and clothing. As an illustrative example, the proposed method improves the Chamfer Distance Error and Normal Error by 38.2% and 57.6%.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1057-1067"},"PeriodicalIF":1.5,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12306","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Social-ATPGNN: Prediction of multi-modal pedestrian trajectory of non-homogeneous social interaction 社交-ATPGNN：预测非同质社交互动的多模式行人轨迹

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-08-21 DOI: 10.1049/cvi2.12286

Kehao Wang, Han Zou

With the development of automatic driving and path planning technology, predicting the moving trajectory of pedestrians in dynamic scenes has become one of key and urgent technical problems. However, most of the existing techniques regard all pedestrians in the scene as equally important influence on the predicted pedestrian's trajectory, and the existing methods which use sequence-based time-series generative models to obtain the predicted trajectories, do not allow for parallel computation, it will introduce a significant computational overhead. A new social trajectory prediction network, Social-ATPGNN which integrates both temporal information and spatial one based on ATPGNN is proposed. In space domain, the pedestrians in the predicted scene are formed into an undirected and non fully connected graph, which solves the problem of homogenisation of pedestrian relationships, then, the spatial interaction between pedestrians is encoded to improve the accuracy of modelling pedestrian social consciousness. After acquiring high-level spatial data, the method uses Temporal Convolutional Network which could perform parallel calculations to capture the correlation of time series of pedestrian trajectories. Through a large number of experiments, the proposed model shows the superiority over the latest models on various pedestrian trajectory datasets.

随着自动驾驶和路径规划技术的发展，预测行人在动态场景中的移动轨迹已成为一个关键而紧迫的技术问题。然而，现有技术大多将场景中的所有行人视为对预测行人轨迹具有同等重要影响的因素，而且现有方法使用基于序列的时间序列生成模型来获取预测轨迹，无法实现并行计算，这将带来巨大的计算开销。本文提出了一种新的社会轨迹预测网络--Social-ATPGNN，它基于 ATPGNN，同时整合了时间信息和空间信息。在空间域，将预测场景中的行人组成一个无向、非全连接的图，解决了行人关系同质化的问题，然后对行人之间的空间交互进行编码，提高了行人社会意识建模的准确性。在获取高层空间数据后，该方法利用时序卷积网络（Temporal Convolutional Network）进行并行计算，捕捉行人轨迹时间序列的相关性。通过大量实验，所提出的模型在各种行人轨迹数据集上显示出优于最新模型的性能。

{"title":"Social-ATPGNN: Prediction of multi-modal pedestrian trajectory of non-homogeneous social interaction","authors":"Kehao Wang, Han Zou","doi":"10.1049/cvi2.12286","DOIUrl":"https://doi.org/10.1049/cvi2.12286","url":null,"abstract":"With the development of automatic driving and path planning technology, predicting the moving trajectory of pedestrians in dynamic scenes has become one of key and urgent technical problems. However, most of the existing techniques regard all pedestrians in the scene as equally important influence on the predicted pedestrian's trajectory, and the existing methods which use sequence-based time-series generative models to obtain the predicted trajectories, do not allow for parallel computation, it will introduce a significant computational overhead. A new social trajectory prediction network, Social-ATPGNN which integrates both temporal information and spatial one based on ATPGNN is proposed. In space domain, the pedestrians in the predicted scene are formed into an undirected and non fully connected graph, which solves the problem of homogenisation of pedestrian relationships, then, the spatial interaction between pedestrians is encoded to improve the accuracy of modelling pedestrian social consciousness. After acquiring high-level spatial data, the method uses Temporal Convolutional Network which could perform parallel calculations to capture the correlation of time series of pedestrian trajectories. Through a large number of experiments, the proposed model shows the superiority over the latest models on various pedestrian trajectory datasets.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"907-921"},"PeriodicalIF":1.5,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12286","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HIST: Hierarchical and sequential transformer for image captioning HIST：用于图像标题的分层和顺序变换器

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-08-15 DOI: 10.1049/cvi2.12305

Feixiao Lv, Rui Wang, Lihua Jing, Pengwen Dai

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder–decoder transformer framework. Such transformer structures, however, show two main limitations in the task of image captioning. Firstly, the traditional transformer obtains high-level fusion features to decode while ignoring other-level features, resulting in losses of image content. Secondly, the transformer is weak in modelling the natural order characteristics of language. To address theseissues, the authors propose a HIerarchical and Sequential Transformer (HIST) structure, which forces each layer of the encoder and decoder to focus on features of different granularities, and strengthen the sequentially semantic information. Specifically, to capture the details of different levels of features in the image, the authors combine the visual features of multiple regions and divide them into multiple levels differently. In addition, to enhance the sequential information, the sequential enhancement module in each decoder layer block extracts different levels of features for sequentially semantic extraction and expression. Extensive experiments on the public datasets MS-COCO and Flickr30k have demonstrated the effectiveness of our proposed method, and show that the authors’ method outperforms most of previous state of the arts.

图像标题旨在自动生成给定图像的自然语言描述，大多数先进模型都采用了编码器-解码器转换器框架。然而，这种转换器结构在图像字幕任务中显示出两大局限性。首先，传统的变换器获取高层次的融合特征进行解码，而忽略了其他层次的特征，导致图像内容的损失。其次，转换器在模拟语言的自然顺序特征方面比较薄弱。针对这些问题，作者提出了分层和顺序变换器（HIST）结构，迫使编码器和解码器的每一层都关注不同粒度的特征，并加强顺序语义信息。具体来说，为了捕捉图像中不同层次的细节特征，作者结合了多个区域的视觉特征，并将其不同地划分为多个层次。此外，为了增强顺序信息，每个解码器层块中的顺序增强模块会提取不同层次的特征，用于顺序语义的提取和表达。在公共数据集 MS-COCO 和 Flickr30k 上进行的大量实验证明了我们提出的方法的有效性，并表明作者的方法优于之前的大多数技术水平。

{"title":"HIST: Hierarchical and sequential transformer for image captioning","authors":"Feixiao Lv, Rui Wang, Lihua Jing, Pengwen Dai","doi":"10.1049/cvi2.12305","DOIUrl":"https://doi.org/10.1049/cvi2.12305","url":null,"abstract":"Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder–decoder transformer framework. Such transformer structures, however, show two main limitations in the task of image captioning. Firstly, the traditional transformer obtains high-level fusion features to decode while ignoring other-level features, resulting in losses of image content. Secondly, the transformer is weak in modelling the natural order characteristics of language. To address theseissues, the authors propose a HIerarchical and Sequential Transformer (HIST) structure, which forces each layer of the encoder and decoder to focus on features of different granularities, and strengthen the sequentially semantic information. Specifically, to capture the details of different levels of features in the image, the authors combine the visual features of multiple regions and divide them into multiple levels differently. In addition, to enhance the sequential information, the sequential enhancement module in each decoder layer block extracts different levels of features for sequentially semantic extraction and expression. Extensive experiments on the public datasets MS-COCO and Flickr30k have demonstrated the effectiveness of our proposed method, and show that the authors’ method outperforms most of previous state of the arts.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1043-1056"},"PeriodicalIF":1.5,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12305","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-modal video search by examples—A video quality impact analysis 通过实例进行多模式视频搜索--视频质量影响分析

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-27 DOI: 10.1049/cvi2.12303

Guanfeng Wu, Abbas Haider, Xing Tian, Erfan Loweimi, Chi Ho Chan, Mengjie Qian, Awan Muhammad, Ivor Spence, Rob Cooper, Wing W. Y. Ng, Josef Kittler, Mark Gales, Hui Wang

As the proliferation of video content continues, and many video archives lack suitable metadata, therefore, video retrieval, particularly through example-based search, has become increasingly crucial. Existing metadata often fails to meet the needs of specific types of searches, especially when videos contain elements from different modalities, such as visual and audio. Consequently, developing video retrieval methods that can handle multi-modal content is essential. An innovative Multi-modal Video Search by Examples (MVSE) framework is introduced, employing state-of-the-art techniques in its various components. In designing MVSE, the authors focused on accuracy, efficiency, interactivity, and extensibility, with key components including advanced data processing and a user-friendly interface aimed at enhancing search effectiveness and user experience. Furthermore, the framework was comprehensively evaluated, assessing individual components, data quality issues, and overall retrieval performance using high-quality and low-quality BBC archive videos. The evaluation reveals that: (1) multi-modal search yields better results than single-modal search; (2) the quality of video, both visual and audio, has an impact on the query precision. Compared with image query results, audio quality has a greater impact on the query precision (3) a two-stage search process (i.e. searching by Hamming distance based on hashing, followed by searching by Cosine similarity based on embedding); is effective but increases time overhead; (4) large-scale video retrieval is not only feasible but also expected to emerge shortly.

随着视频内容的不断激增，许多视频档案缺乏合适的元数据，因此，视频检索，尤其是通过基于实例的检索，变得越来越重要。现有的元数据往往无法满足特定类型搜索的需求，尤其是当视频包含视觉和音频等不同模式的元素时。因此，开发能够处理多模式内容的视频检索方法至关重要。本文介绍了一个创新的多模态视频示例搜索（MVSE）框架，该框架的各个组成部分都采用了最先进的技术。在设计 MVSE 时，作者将重点放在准确性、效率、交互性和可扩展性上，其中的关键组件包括高级数据处理和用户友好界面，旨在提高搜索效果和用户体验。此外，还对该框架进行了全面评估，使用高质量和低质量的 BBC 档案视频评估了各个组件、数据质量问题和整体检索性能。评估结果表明(1) 多模态搜索比单模态搜索产生更好的结果；(2) 视频质量，包括视觉和音频质量，对查询精度都有影响。与图像查询结果相比，音频质量对查询精度的影响更大；(3) 两阶段搜索过程（即基于哈希值的汉明距离搜索，然后是基于嵌入的余弦相似度搜索）是有效的，但会增加时间开销；(4) 大规模视频检索不仅可行，而且有望在短期内出现。

{"title":"Multi-modal video search by examples—A video quality impact analysis","authors":"Guanfeng Wu, Abbas Haider, Xing Tian, Erfan Loweimi, Chi Ho Chan, Mengjie Qian, Awan Muhammad, Ivor Spence, Rob Cooper, Wing W. Y. Ng, Josef Kittler, Mark Gales, Hui Wang","doi":"10.1049/cvi2.12303","DOIUrl":"10.1049/cvi2.12303","url":null,"abstract":"As the proliferation of video content continues, and many video archives lack suitable metadata, therefore, video retrieval, particularly through example-based search, has become increasingly crucial. Existing metadata often fails to meet the needs of specific types of searches, especially when videos contain elements from different modalities, such as visual and audio. Consequently, developing video retrieval methods that can handle multi-modal content is essential. An innovative Multi-modal Video Search by Examples (MVSE) framework is introduced, employing state-of-the-art techniques in its various components. In designing MVSE, the authors focused on accuracy, efficiency, interactivity, and extensibility, with key components including advanced data processing and a user-friendly interface aimed at enhancing search effectiveness and user experience. Furthermore, the framework was comprehensively evaluated, assessing individual components, data quality issues, and overall retrieval performance using high-quality and low-quality BBC archive videos. The evaluation reveals that: (1) multi-modal search yields better results than single-modal search; (2) the quality of video, both visual and audio, has an impact on the query precision. Compared with image query results, audio quality has a greater impact on the query precision (3) a two-stage search process (i.e. searching by Hamming distance based on hashing, followed by searching by Cosine similarity based on embedding); is effective but increases time overhead; (4) large-scale video retrieval is not only feasible but also expected to emerge shortly.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1017-1033"},"PeriodicalIF":1.5,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12303","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141798043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

2D human skeleton action recognition with spatial constraints 带空间约束的二维人体骨骼动作识别

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-11 DOI: 10.1049/cvi2.12296

Lei Wang, Jianwei Zhang, Wenbing Yang, Song Gu, Shanmin Yang

Human actions are predominantly presented in 2D format in video surveillance scenarios, which hinders the accurate determination of action details not apparent in 2D data. Depth estimation can aid human action recognition tasks, enhancing accuracy with neural networks. However, reliance on images for depth estimation requires extensive computational resources and cannot utilise the connectivity between human body structures. Besides, the depth information may not accurately reflect actual depth ranges, necessitating improved reliability. Therefore, a 2D human skeleton action recognition method with spatial constraints (2D-SCHAR) is introduced. 2D-SCHAR employs graph convolution networks to process graph-structured human action skeleton data comprising three parts: depth estimation, spatial transformation, and action recognition. The initial two components, which infer 3D information from 2D human skeleton actions and generate spatial transformation parameters to correct abnormal deviations in action data, support the latter in the model to enhance the accuracy of action recognition. The model is designed in an end-to-end, multitasking manner, allowing parameter sharing among these three components to boost performance. The experimental results validate the model's effectiveness and superiority in human skeleton action recognition.

在视频监控场景中，人类动作主要以二维格式呈现，这就阻碍了对二维数据中不明显的动作细节的准确判断。深度估算可以帮助完成人类动作识别任务，提高神经网络的准确性。然而，依赖图像进行深度估计需要大量的计算资源，而且无法利用人体结构之间的连接性。此外，深度信息可能无法准确反映实际深度范围，因此需要提高可靠性。因此，我们引入了一种具有空间约束的二维人体骨骼动作识别方法（2D-SCHAR）。2D-SCHAR 采用图卷积网络来处理图结构的人体动作骨骼数据，包括深度估计、空间转换和动作识别三个部分。最初的两个部分从二维人体骨骼动作中推断三维信息，并生成空间变换参数以纠正动作数据中的异常偏差，这两个部分为模型中的后一个部分提供支持，以提高动作识别的准确性。该模型采用端到端多任务设计，允许这三个部分共享参数，以提高性能。实验结果验证了该模型在人体骨骼动作识别方面的有效性和优越性。

{"title":"2D human skeleton action recognition with spatial constraints","authors":"Lei Wang, Jianwei Zhang, Wenbing Yang, Song Gu, Shanmin Yang","doi":"10.1049/cvi2.12296","DOIUrl":"10.1049/cvi2.12296","url":null,"abstract":"Human actions are predominantly presented in 2D format in video surveillance scenarios, which hinders the accurate determination of action details not apparent in 2D data. Depth estimation can aid human action recognition tasks, enhancing accuracy with neural networks. However, reliance on images for depth estimation requires extensive computational resources and cannot utilise the connectivity between human body structures. Besides, the depth information may not accurately reflect actual depth ranges, necessitating improved reliability. Therefore, a 2D human skeleton action recognition method with spatial constraints (2D-SCHAR) is introduced. 2D-SCHAR employs graph convolution networks to process graph-structured human action skeleton data comprising three parts: depth estimation, spatial transformation, and action recognition. The initial two components, which infer 3D information from 2D human skeleton actions and generate spatial transformation parameters to correct abnormal deviations in action data, support the latter in the model to enhance the accuracy of action recognition. The model is designed in an end-to-end, multitasking manner, allowing parameter sharing among these three components to boost performance. The experimental results validate the model's effectiveness and superiority in human skeleton action recognition.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"968-981"},"PeriodicalIF":1.5,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12296","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141657484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Centre-loss—A preferred class verification approach over sample-to-sample in self-checkout products datasets 中心损失--自助结账产品数据集中优于样本到样本的类别验证方法

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-07-11 DOI: 10.1049/cvi2.12302

Bernardas Ciapas, Povilas Treigys

Siamese networks excel at comparing two images, serving as an effective class verification technique for a single-per-class reference image. However, when multiple reference images are present, Siamese verification necessitates multiple comparisons and aggregation, often unpractical at inference. The Centre-Loss approach, proposed in this research, solves a class verification task more efficiently, using a single forward-pass during inference, than sample-to-sample approaches. Optimising a Centre-Loss function learns class centres and minimises intra-class distances in latent space. The authors compared verification accuracy using Centre-Loss against aggregated Siamese when other hyperparameters (such as neural network backbone and distance type) are the same. Experiments were performed to contrast the ubiquitous Euclidean against other distance types to discover the optimum Centre-Loss layer, its size, and Centre-Loss weight. In optimal architecture, the Centre-Loss layer is connected to the penultimate layer, calculates Euclidean distance, and its size depends on distance type. The Centre-Loss method was validated on the Self-Checkout products and Fruits 360 image datasets. Centre-Loss comparable accuracy and lesser complexity make it a preferred approach over sample-to-sample for the class verification task, when the number of reference image per class is high and inference speed is a factor, such as in self-checkouts.

连体网络擅长比较两幅图像，对于单幅参考图像而言，它是一种有效的类别验证技术。然而，当存在多个参考图像时，连体验证就必须进行多次比较和汇总，这在推理中往往是不切实际的。与样本到样本方法相比，本研究提出的中心损失方法在推理过程中只需一次前向传递，就能更高效地解决类别验证任务。优化中心-损失函数可以学习类中心，并将潜在空间中的类内距离最小化。作者比较了在其他超参数（如神经网络骨干和距离类型）相同的情况下，使用 Centre-Loss 与聚合 Siamese 的验证准确率。实验中，他们将无处不在的欧氏距离与其他距离类型进行了对比，从而发现了最佳的中心损失层、其大小和中心损失权重。在最佳架构中，中心损失层与倒数第二层相连，计算欧氏距离，其大小取决于距离类型。Centre-Loss 方法在自助结账产品和水果 360 图像数据集上进行了验证。与样本到样本方法相比，Centre-Loss 方法具有可比的准确性和较低的复杂性，因此在类别验证任务中，当每个类别的参考图像数量较多且推理速度是一个因素时，例如在自助结账中，Centre-Loss 方法是一种首选方法。

{"title":"Centre-loss—A preferred class verification approach over sample-to-sample in self-checkout products datasets","authors":"Bernardas Ciapas, Povilas Treigys","doi":"10.1049/cvi2.12302","DOIUrl":"10.1049/cvi2.12302","url":null,"abstract":"Siamese networks excel at comparing two images, serving as an effective class verification technique for a single-per-class reference image. However, when multiple reference images are present, Siamese verification necessitates multiple comparisons and aggregation, often unpractical at inference. The Centre-Loss approach, proposed in this research, solves a class verification task more efficiently, using a single forward-pass during inference, than sample-to-sample approaches. Optimising a Centre-Loss function learns class centres and minimises intra-class distances in latent space. The authors compared verification accuracy using Centre-Loss against aggregated Siamese when other hyperparameters (such as neural network backbone and distance type) are the same. Experiments were performed to contrast the ubiquitous Euclidean against other distance types to discover the optimum Centre-Loss layer, its size, and Centre-Loss weight. In optimal architecture, the Centre-Loss layer is connected to the penultimate layer, calculates Euclidean distance, and its size depends on distance type. The Centre-Loss method was validated on the Self-Checkout products and Fruits 360 image datasets. Centre-Loss comparable accuracy and lesser complexity make it a preferred approach over sample-to-sample for the class verification task, when the number of reference image per class is high and inference speed is a factor, such as in self-checkouts.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"1004-1016"},"PeriodicalIF":1.5,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12302","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141657814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0