首页 > 最新文献

IET Computer Vision最新文献

英文 中文
LRCM: Enhancing Adversarial Purification Through Latent Representation Compression LRCM:通过潜在表征压缩增强对抗性纯化
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-05-19 DOI: 10.1049/cvi2.70030
Yixin Li, Xintao Luo, Weijie Wu, Minjia Zheng

In the current context of the extensive use of deep neural networks, it has been observed that neural network models are vulnerable to adversarial perturbations, which may lead to unexpected results. In this paper, we introduce an Adversarial Purification Model rooted in latent representation compression, aimed at enhancing the robustness of deep learning models. Initially, we employ an encoder-decoder architecture inspired by the U-net to extract features from input samples. Subsequently, these features undergo a process of information compression to remove adversarial perturbations from the latent space. To counteract the model's tendency to overly focus on fine-grained details of input samples, resulting in ineffective adversarial sample purification, an early freezing mechanism is introduced during the encoder training process. We tested our model's ability to purify adversarial samples generated from the CIFAR-10, CIFAR-100, and ImageNet datasets using various methods. These samples were then used to test ResNet, an image recognition classifiers. Our experiments covered different resolutions and attack types to fully assess LRCM's effectiveness against adversarial attacks. We also compared LRCM with other defence strategies, demonstrating its strong defensive capabilities.

在当前深度神经网络广泛使用的背景下,已经观察到神经网络模型容易受到对抗性扰动,这可能导致意想不到的结果。在本文中,我们引入了一种基于潜在表示压缩的对抗净化模型,旨在增强深度学习模型的鲁棒性。最初,我们采用受U-net启发的编码器-解码器架构从输入样本中提取特征。随后,这些特征经历一个信息压缩过程,以消除潜在空间中的对抗性扰动。为了抵消模型过度关注输入样本的细粒度细节的倾向,导致无效的对抗性样本纯化,在编码器训练过程中引入了早期冻结机制。我们使用各种方法测试了我们的模型纯化从CIFAR-10、CIFAR-100和ImageNet数据集生成的对抗性样本的能力。这些样本随后被用于测试ResNet,一个图像识别分类器。我们的实验涵盖了不同的分辨率和攻击类型,以充分评估LRCM对对抗性攻击的有效性。我们还将LRCM与其他防御策略进行了比较,证明了其强大的防御能力。
{"title":"LRCM: Enhancing Adversarial Purification Through Latent Representation Compression","authors":"Yixin Li,&nbsp;Xintao Luo,&nbsp;Weijie Wu,&nbsp;Minjia Zheng","doi":"10.1049/cvi2.70030","DOIUrl":"10.1049/cvi2.70030","url":null,"abstract":"<p>In the current context of the extensive use of deep neural networks, it has been observed that neural network models are vulnerable to adversarial perturbations, which may lead to unexpected results. In this paper, we introduce an Adversarial Purification Model rooted in latent representation compression, aimed at enhancing the robustness of deep learning models. Initially, we employ an encoder-decoder architecture inspired by the U-net to extract features from input samples. Subsequently, these features undergo a process of information compression to remove adversarial perturbations from the latent space. To counteract the model's tendency to overly focus on fine-grained details of input samples, resulting in ineffective adversarial sample purification, an early freezing mechanism is introduced during the encoder training process. We tested our model's ability to purify adversarial samples generated from the CIFAR-10, CIFAR-100, and ImageNet datasets using various methods. These samples were then used to test ResNet, an image recognition classifiers. Our experiments covered different resolutions and attack types to fully assess LRCM's effectiveness against adversarial attacks. We also compared LRCM with other defence strategies, demonstrating its strong defensive capabilities.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70030","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144085290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Geometric Edge Modelling in Self-Supervised Learning for Enhanced Indoor Depth Estimation 基于自监督学习的几何边缘建模增强室内深度估计
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-05-12 DOI: 10.1049/cvi2.70026
Niclas Joswig, Laura Ruotsalainen

Recently, the accuracy of self-supervised deep learning models for indoor depth estimation has approached that of supervised models by improving the supervision in planar regions. However, a common issue with integrating multiple planar priors is the generation of oversmooth depth maps, leading to unrealistic and erroneous depth representations at edges. Despite the fact that edge pixels only cover a small part of the image, they are of high significance for downstream tasks such as visual odometry, where image features, essential for motion computation, are mostly located at edges. To improve erroneous depth predictions at edge regions, we delve into the self-supervised training process, identifying its limitations and using these insights to develop a geometric edge model. Building on this, we introduce a novel algorithm that utilises the smooth depth predictions of existing models and colour image data to accurately identify edge pixels. After finding the edge pixels, our approach generates targeted self-supervision in these zones by interpolating depth values from adjacent planar areas towards the edges. We integrate the proposed algorithms into a novel loss function that encourages neural networks to predict sharper and more accurate depth edges in indoor scenes. To validate our methodology, we incorporated the proposed edge-enhancing loss function into a state-of-the-art self-supervised depth estimation framework. Our results demonstrate a notable improvement in the accuracy of edge depth predictions and a 19% improvement in visual odometry when using our depth model to generate RGB-D input, compared to the baseline model.

近年来,通过改进平面区域的监督,自监督深度学习模型的室内深度估计精度已经接近监督模型。然而,整合多个平面先验的一个常见问题是生成过于光滑的深度图,导致边缘的深度表示不现实和错误。尽管边缘像素只覆盖了图像的一小部分,但它们对于视觉里程计等下游任务具有很高的意义,其中运动计算所必需的图像特征大多位于边缘。为了改善边缘区域的错误深度预测,我们深入研究了自监督训练过程,确定了其局限性,并利用这些见解开发了几何边缘模型。在此基础上,我们引入了一种新的算法,该算法利用现有模型和彩色图像数据的平滑深度预测来准确识别边缘像素。在找到边缘像素后,我们的方法通过将相邻平面区域的深度值插值到边缘,在这些区域中产生有针对性的自我监督。我们将提出的算法集成到一个新的损失函数中,该损失函数鼓励神经网络在室内场景中预测更清晰、更准确的深度边缘。为了验证我们的方法,我们将提出的边缘增强损失函数合并到最先进的自监督深度估计框架中。我们的结果表明,与基线模型相比,使用我们的深度模型生成RGB-D输入时,边缘深度预测的准确性有了显着提高,视觉里程计提高了19%。
{"title":"Geometric Edge Modelling in Self-Supervised Learning for Enhanced Indoor Depth Estimation","authors":"Niclas Joswig,&nbsp;Laura Ruotsalainen","doi":"10.1049/cvi2.70026","DOIUrl":"10.1049/cvi2.70026","url":null,"abstract":"<p>Recently, the accuracy of self-supervised deep learning models for indoor depth estimation has approached that of supervised models by improving the supervision in planar regions. However, a common issue with integrating multiple planar priors is the generation of <i>oversmooth</i> depth maps, leading to unrealistic and erroneous depth representations at edges. Despite the fact that edge pixels only cover a small part of the image, they are of high significance for downstream tasks such as visual odometry, where image features, essential for motion computation, are mostly located at edges. To improve erroneous depth predictions at edge regions, we delve into the self-supervised training process, identifying its limitations and using these insights to develop a geometric edge model. Building on this, we introduce a novel algorithm that utilises the smooth depth predictions of existing models and colour image data to accurately identify edge pixels. After finding the edge pixels, our approach generates targeted self-supervision in these zones by interpolating depth values from adjacent planar areas towards the edges. We integrate the proposed algorithms into a novel loss function that encourages neural networks to predict sharper and more accurate depth edges in indoor scenes. To validate our methodology, we incorporated the proposed edge-enhancing loss function into a state-of-the-art self-supervised depth estimation framework. Our results demonstrate a notable improvement in the accuracy of edge depth predictions and a 19% improvement in visual odometry when using our depth model to generate RGB-D input, compared to the baseline model.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143938938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adapting the Re-ID Challenge for Static Sensors 适应静态传感器的Re-ID挑战
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-05-08 DOI: 10.1049/cvi2.70027
Avirath Sundaresan, Jason Parham, Jonathan Crall, Rosemary Warungu, Timothy Muthami, Jackson Miliko, Margaret Mwangi, Jason Holmberg, Tanya Berger-Wolf, Daniel Rubenstein, Charles Stewart, Sara Beery

The Grévy's zebra, an endangered species native to Kenya and southern Ethiopia, has been the target of sustained conservation efforts in recent years. Accurately monitoring Grévy's zebra populations is essential for ecologists to evaluate ongoing conservation initiatives. Recently, in both 2016 and 2018, a full census of the Grévy's zebra population was enabled by the Great Grévy's Rally (GGR), a citizen science event that combines teams of volunteers to capture data with computer vision algorithms that help experts estimate the number of individuals in the population. A complementary, scalable, cost-effective and long-term Grévy's population monitoring approach involves deploying a network of camera traps, which we have done at the Mpala Research Centre in Laikipia County, Kenya. In both scenarios, a substantial majority of the images of zebras are not usable for individual identification due to ‘in-the-wild’ imaging conditions—occlusions from vegetation or other animals, oblique views, low image quality and animals that appear in the far background and are thus too small to identify. Camera trap images, without an intelligent human photographer to select the framing and focus on the animals of interest, are of even poorer quality, with high rates of occlusion and high spatiotemporal similarity within image bursts. We employ an image filtering pipeline incorporating animal detection, species identification, viewpoint estimation, quality evaluation and temporal subsampling to compensate for these factors and obtain individual crops from camera trap and GGR images of suitable quality for re-ID. We then employ the local clusterings and their alternatives (LCA) algorithm, a hybrid computer vision and graph clustering method for animal re-ID, on the resulting high-quality crops. Our method processed images taken during GGR-16 and GGR-18 in Meru County, Kenya, into 4142 highly comparable annotations, requiring only 120 contrastive same-vs-different-individual decisions from a human reviewer to produce a population estimate of 349 individuals (within 4.6% $%$ of the ground truth count in Meru County). Our method also efficiently processed 8.9M unlabelled camera trap images from 70 camera traps at Mpala over 2 years into 685 encounters of 173 unique individuals, requiring only 331 contrastive decisions from a human reviewer.

格拉西维斑马是一种原产于肯尼亚和埃塞俄比亚南部的濒危物种,近年来一直是持续保护工作的目标。准确监测格莱姆萨维的斑马数量对生态学家评估正在进行的保护行动至关重要。最近,在2016年和2018年,一项公民科学活动“格兰萨维大集会”(GGR)对格兰萨维斑马种群进行了全面普查,这是一项公民科学活动,由志愿者团队用计算机视觉算法捕获数据,帮助专家估计种群中的个体数量。一种补充性的、可扩展的、具有成本效益的、长期的格莱姆萨维种群监测方法包括部署一个摄像机陷阱网络,这是我们在肯尼亚莱基皮亚县的姆帕拉研究中心所做的。在这两种情况下,由于“野外”成像条件——植被或其他动物遮挡、视角倾斜、图像质量低以及动物出现在远处的背景中,因此太小而无法识别,斑马的大部分图像都无法用于个体识别。相机陷阱图像,没有一个聪明的人类摄影师来选择框架和关注感兴趣的动物,甚至质量更差,在图像爆发中具有高遮挡率和高时空相似性。我们采用了一个包含动物检测、物种识别、视点估计、质量评估和时间子采样的图像过滤管道来补偿这些因素,并从相机陷阱和GGR图像中获得适合重新识别的质量的单个作物。然后,我们使用局部聚类及其替代(LCA)算法,这是一种用于动物重新识别的混合计算机视觉和图聚类方法,用于得到高质量的作物。我们的方法将在肯尼亚梅鲁县(Meru County)进行的GGR-16和GGR-18期间拍摄的图像处理成4142个高度可比较的注释,只需要来自人类审核员的120个对比相同/不同个体的决定,就可以产生349个个体的人口估计(在梅鲁县(Meru County)的地面真实计数的4.6% $%$范围内)。我们的方法还有效地处理了890万张未标记的摄像机陷阱图像,这些图像来自于2年来在Mpala的70个摄像机陷阱,涉及173个独特个体的685次遭遇,只需要人类审稿人做出331次对比决定。
{"title":"Adapting the Re-ID Challenge for Static Sensors","authors":"Avirath Sundaresan,&nbsp;Jason Parham,&nbsp;Jonathan Crall,&nbsp;Rosemary Warungu,&nbsp;Timothy Muthami,&nbsp;Jackson Miliko,&nbsp;Margaret Mwangi,&nbsp;Jason Holmberg,&nbsp;Tanya Berger-Wolf,&nbsp;Daniel Rubenstein,&nbsp;Charles Stewart,&nbsp;Sara Beery","doi":"10.1049/cvi2.70027","DOIUrl":"10.1049/cvi2.70027","url":null,"abstract":"<p>The Grévy's zebra, an endangered species native to Kenya and southern Ethiopia, has been the target of sustained conservation efforts in recent years. Accurately monitoring Grévy's zebra populations is essential for ecologists to evaluate ongoing conservation initiatives. Recently, in both 2016 and 2018, a full census of the Grévy's zebra population was enabled by the Great Grévy's Rally (GGR), a citizen science event that combines teams of volunteers to capture data with computer vision algorithms that help experts estimate the number of individuals in the population. A complementary, scalable, cost-effective and long-term Grévy's population monitoring approach involves deploying a network of camera traps, which we have done at the Mpala Research Centre in Laikipia County, Kenya. In both scenarios, a substantial majority of the images of zebras are not usable for individual identification due to ‘in-the-wild’ imaging conditions—occlusions from vegetation or other animals, oblique views, low image quality and animals that appear in the far background and are thus too small to identify. Camera trap images, without an intelligent human photographer to select the framing and focus on the animals of interest, are of even poorer quality, with high rates of occlusion and high spatiotemporal similarity within image bursts. We employ an image filtering pipeline incorporating animal detection, species identification, viewpoint estimation, quality evaluation and temporal subsampling to compensate for these factors and obtain individual crops from camera trap and GGR images of suitable quality for re-ID. We then employ the local clusterings and their alternatives (LCA) algorithm, a hybrid computer vision and graph clustering method for animal re-ID, on the resulting high-quality crops. Our method processed images taken during GGR-16 and GGR-18 in Meru County, Kenya, into 4142 highly comparable annotations, requiring only 120 contrastive same-vs-different-individual decisions from a human reviewer to produce a population estimate of 349 individuals (within 4.6<span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>%</mi>\u0000 </mrow>\u0000 <annotation> $%$</annotation>\u0000 </semantics></math> of the ground truth count in Meru County). Our method also efficiently processed 8.9M unlabelled camera trap images from 70 camera traps at Mpala over 2 years into 685 encounters of 173 unique individuals, requiring only 331 contrastive decisions from a human reviewer.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70027","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143925963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Texture-Aware Network for Enhancing Inner Smoke Representation in Visual Smoke Density Estimation 视觉烟雾密度估计中增强内部烟雾表示的纹理感知网络
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-05-06 DOI: 10.1049/cvi2.70023
Xue Xia, Yajing Peng, Zichen Li, Jinting Shi, Yuming Fang

Smoke often appears before visible flames in the early stages of fire disasters, making accurate pixel-wise detection essential for fire alarms. Although existing segmentation models effectively identify smoke pixels, they generally treat all pixels within a smoke region as having the same prior probability. This assumption of rigidity, common in natural object segmentation, fails to account for the inherent variability within smoke. We argue that pixels within smoke exhibit a probabilistic relationship with both smoke and background, necessitating density estimation to enhance the representation of internal structures within the smoke. To this end, we propose enhancements across the entire network. First, we improve the backbone by adaptively integrating scene information into texture features through separate paths, enabling smoke-tailored feature representation for further exploit. Second, we introduce a texture-aware head with long convolutional kernels to integrate both global and orientation-specific information, enhancing representation for intricate smoke structure. Third, we develop a dual-task decoder for simultaneous density and location recovery, with the frequency-domain alignment in the final stage to preserve internal smoke details. Extensive experiments on synthetic and real smoke datasets demonstrate the effectiveness of our approach. Specifically, comparisons with 17 models show the superiority of our method, with mean IoU improvements of 4.88%, 2.63%, and 3.17% on three test sets. (The code will be available on https://github.com/xia-xx-cv/TANet_smoke).

在火灾的早期阶段,烟雾通常出现在可见的火焰之前,这使得精确的像素检测对火灾报警器至关重要。虽然现有的分割模型可以有效地识别烟雾像素,但它们通常将烟雾区域内的所有像素视为具有相同的先验概率。这种刚性假设在自然物体分割中很常见,但不能解释烟雾中固有的可变性。我们认为烟雾中的像素与烟雾和背景都表现出概率关系,因此需要密度估计来增强烟雾内部结构的表示。为此,我们建议对整个网络进行增强。首先,我们通过单独的路径自适应地将场景信息集成到纹理特征中,从而改进主干,使烟雾定制特征表示能够进一步利用。其次,我们引入了一个具有长卷积核的纹理感知头部,以整合全局和方向特定信息,增强对复杂烟雾结构的表示。第三,我们开发了一种双任务解码器,用于同时恢复密度和位置,在最后阶段进行频域对齐以保留内部烟雾细节。在合成和真实烟雾数据集上进行的大量实验证明了我们的方法的有效性。具体来说,与17个模型的比较表明了我们的方法的优越性,在三个测试集上平均IoU提高了4.88%,2.63%和3.17%。(代码可在https://github.com/xia-xx-cv/TANet_smoke上获得)。
{"title":"Texture-Aware Network for Enhancing Inner Smoke Representation in Visual Smoke Density Estimation","authors":"Xue Xia,&nbsp;Yajing Peng,&nbsp;Zichen Li,&nbsp;Jinting Shi,&nbsp;Yuming Fang","doi":"10.1049/cvi2.70023","DOIUrl":"10.1049/cvi2.70023","url":null,"abstract":"<p>Smoke often appears before visible flames in the early stages of fire disasters, making accurate pixel-wise detection essential for fire alarms. Although existing segmentation models effectively identify smoke pixels, they generally treat all pixels within a smoke region as having the same prior probability. This assumption of rigidity, common in natural object segmentation, fails to account for the inherent variability within smoke. We argue that pixels within smoke exhibit a probabilistic relationship with both smoke and background, necessitating density estimation to enhance the representation of internal structures within the smoke. To this end, we propose enhancements across the entire network. First, we improve the backbone by adaptively integrating scene information into texture features through separate paths, enabling smoke-tailored feature representation for further exploit. Second, we introduce a texture-aware head with long convolutional kernels to integrate both global and orientation-specific information, enhancing representation for intricate smoke structure. Third, we develop a dual-task decoder for simultaneous density and location recovery, with the frequency-domain alignment in the final stage to preserve internal smoke details. Extensive experiments on synthetic and real smoke datasets demonstrate the effectiveness of our approach. Specifically, comparisons with 17 models show the superiority of our method, with mean IoU improvements of 4.88%, 2.63%, and 3.17% on three test sets. (The code will be available on https://github.com/xia-xx-cv/TANet_smoke).</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70023","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143914029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Angle Metric Learning for Discriminative Features on Vehicle Re-Identification 基于角度度量学习的车辆再识别判别特征
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-05-04 DOI: 10.1049/cvi2.70015
Yutong Xie, Shuoqi Zhang, Lide Guo, Yuming Liu, Rukai Wei, Yanzhao Xie, Yangtao Wang, Maobin Tang, Lisheng Fan

Vehicle re-identification (Re-ID) facilitates the recognition and distinction of vehicles based on their visual characteristics in images or videos. However, accurately identifying a vehicle poses great challenges due to (i) the pronounced intra-instance variations encountered under varying lighting conditions such as day and night and (ii) the subtle inter-instance differences observed among similar vehicles. To address these challenges, the authors propose Angle Metric learning for Discriminative Features on vehicle Re-ID (termed as AMDF), which aims to maximise the variance between visual features of different classes while minimising the variance within the same class. AMDF comprehensively measures the angle and distance discrepancies between features. First, to mitigate the impact of lighting conditions on intra-class variation, the authors employ CycleGAN to generate images that simulate consistent lighting (either day or night), thereby standardising the conditions for distance measurement. Second, Swin Transformer was integrated to help generate more detailed features. At last, a novel angle metric loss based on cosine distance is proposed, which organically integrates angular metric and 2-norm metric, effectively maximising the decision boundary in angular space. Extensive experimental evaluations on three public datasets including VERI-776, VERI-Wild, and VEHICLEID, indicate that the method achieves state-of-the-art performance. The code of this project is released at https://github.com/ZnCu-0906/AMDF.

车辆再识别(Re-ID)是一种基于图像或视频中车辆的视觉特征来识别和区分车辆的技术。然而,准确识别车辆带来了巨大的挑战,因为(i)在不同的光照条件下(如白天和黑夜)会遇到明显的实例内差异,(ii)在类似车辆之间观察到的微妙的实例间差异。为了解决这些挑战,作者提出了针对车辆Re-ID的区别特征的角度度量学习(称为AMDF),其目的是最大化不同类别的视觉特征之间的方差,同时最小化同一类别内的方差。AMDF全面测量特征之间的角度和距离差异。首先,为了减轻光照条件对类内变化的影响,作者使用CycleGAN来生成模拟一致光照(白天或夜晚)的图像,从而标准化距离测量的条件。其次,集成了Swin Transformer以帮助生成更详细的特性。最后,提出了一种基于余弦距离的角度量损失算法,将角度量和二范数度量有机地结合起来,有效地实现了角空间决策边界的最大化。在包括VERI-776、VERI-Wild和VEHICLEID在内的三个公共数据集上进行的广泛实验评估表明,该方法达到了最先进的性能。这个项目的代码发布在https://github.com/ZnCu-0906/AMDF。
{"title":"Angle Metric Learning for Discriminative Features on Vehicle Re-Identification","authors":"Yutong Xie,&nbsp;Shuoqi Zhang,&nbsp;Lide Guo,&nbsp;Yuming Liu,&nbsp;Rukai Wei,&nbsp;Yanzhao Xie,&nbsp;Yangtao Wang,&nbsp;Maobin Tang,&nbsp;Lisheng Fan","doi":"10.1049/cvi2.70015","DOIUrl":"10.1049/cvi2.70015","url":null,"abstract":"<p>Vehicle re-identification (Re-ID) facilitates the recognition and distinction of vehicles based on their visual characteristics in images or videos. However, accurately identifying a vehicle poses great challenges due to (i) the pronounced intra-instance variations encountered under varying lighting conditions such as day and night and (ii) the subtle inter-instance differences observed among similar vehicles. To address these challenges, the authors propose <b>A</b>ngle <b>M</b>etric learning for <b>D</b>iscriminative <b>F</b>eatures on vehicle Re-ID (termed as AMDF), which aims to maximise the variance between visual features of different classes while minimising the variance within the same class. AMDF comprehensively measures the angle and distance discrepancies between features. First, to mitigate the impact of lighting conditions on intra-class variation, the authors employ CycleGAN to generate images that simulate consistent lighting (either day or night), thereby standardising the conditions for distance measurement. Second, Swin Transformer was integrated to help generate more detailed features. At last, a novel angle metric loss based on cosine distance is proposed, which organically integrates angular metric and 2-norm metric, effectively maximising the decision boundary in angular space. Extensive experimental evaluations on three public datasets including VERI-776, VERI-Wild, and VEHICLEID, indicate that the method achieves state-of-the-art performance. The code of this project is released at https://github.com/ZnCu-0906/AMDF.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70015","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143905080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tran-GCN: A Transformer-Enhanced Graph Convolutional Network for Person Re-Identification in Monitoring Videos trans - gcn:一种用于监控视频中人物再识别的变压器增强图卷积网络
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-29 DOI: 10.1049/cvi2.70025
Xiaobin Hong, Tarmizi Adam, Masitah Ghazali

Person re-identification (Re-ID) has gained popularity in computer vision, enabling cross-camera pedestrian recognition. Although the development of deep learning has provided a robust technical foundation for person Re-ID research, most existing person Re-ID methods overlook the potential relationships among local person features, failing to adequately address the impact of pedestrian pose variations and local body parts occlusion. Therefore, we propose a transformer-enhanced graph convolutional network (Tran-GCN) model to improve person re-identification performance in monitoring videos. The model comprises four key components: (1) a pose estimation learning branch is utilised to estimate pedestrian pose information and inherent skeletal structure data, extracting pedestrian key point information; (2) a transformer learning branch learns the global dependencies between fine-grained and semantically meaningful local person features; (3) a convolution learning branch uses the basic ResNet architecture to extract the person's fine-grained local features; and (4) a Graph convolutional module (GCM) integrates local feature information, global feature information and body information for more effective person identification after fusion. Quantitative and qualitative analysis experiments conducted on three different datasets (Market-1501, DukeMTMC-ReID and MSMT17) demonstrate that the Tran-GCN model can more accurately capture discriminative person features in monitoring videos, significantly improving identification accuracy.

人再识别(Re-ID)已经在计算机视觉中得到普及,可以跨摄像头识别行人。尽管深度学习的发展为人的Re-ID研究提供了坚实的技术基础,但大多数现有的人的Re-ID方法忽略了局部人特征之间的潜在关系,未能充分解决行人姿势变化和局部身体部位遮挡的影响。因此,我们提出了一种变压器增强的图卷积网络(trans - gcn)模型来提高监控视频中的人员再识别性能。该模型由四个关键部分组成:(1)利用姿态估计学习分支估计行人姿态信息和固有骨架结构数据,提取行人关键点信息;(2)转换学习分支学习细粒度和语义上有意义的局部人物特征之间的全局依赖关系;(3)卷积学习分支使用基本的ResNet架构提取人的细粒度局部特征;(4)图形卷积模块(GCM)融合局部特征信息、全局特征信息和身体信息,融合后更有效地识别人物。在Market-1501、DukeMTMC-ReID和MSMT17三个不同的数据集上进行的定量和定性分析实验表明,trans - gcn模型可以更准确地捕获监控视频中的判别性人物特征,显著提高识别精度。
{"title":"Tran-GCN: A Transformer-Enhanced Graph Convolutional Network for Person Re-Identification in Monitoring Videos","authors":"Xiaobin Hong,&nbsp;Tarmizi Adam,&nbsp;Masitah Ghazali","doi":"10.1049/cvi2.70025","DOIUrl":"10.1049/cvi2.70025","url":null,"abstract":"<p>Person re-identification (Re-ID) has gained popularity in computer vision, enabling cross-camera pedestrian recognition. Although the development of deep learning has provided a robust technical foundation for person Re-ID research, most existing person Re-ID methods overlook the potential relationships among local person features, failing to adequately address the impact of pedestrian pose variations and local body parts occlusion. Therefore, we propose a transformer-enhanced graph convolutional network (Tran-GCN) model to improve person re-identification performance in monitoring videos. The model comprises four key components: (1) a pose estimation learning branch is utilised to estimate pedestrian pose information and inherent skeletal structure data, extracting pedestrian key point information; (2) a transformer learning branch learns the global dependencies between fine-grained and semantically meaningful local person features; (3) a convolution learning branch uses the basic ResNet architecture to extract the person's fine-grained local features; and (4) a Graph convolutional module (GCM) integrates local feature information, global feature information and body information for more effective person identification after fusion. Quantitative and qualitative analysis experiments conducted on three different datasets (Market-1501, DukeMTMC-ReID and MSMT17) demonstrate that the Tran-GCN model can more accurately capture discriminative person features in monitoring videos, significantly improving identification accuracy.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143889172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CNN-Based Flank Predictor for Quadruped Animal Species 基于cnn的四足动物侧面预测器
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-29 DOI: 10.1049/cvi2.70024
Vanessa Suessle, Marco Heurich, Colleen T. Downs, Andreas Weinmann, Elke Hergenroether

The bilateral asymmetry of flanks, where the sides of an animal with unique visual markings are independently patterned, complicates tasks such as individual identification. Automatically generating additional information on the visible side of the animal would improve the accuracy of individual identification. In this study, we used transfer learning on popular convolutional neural network (CNN) image classification architectures to train a flank predictor that predicted the visible flank of quadruped mammalian species in images. We automatically derived the data labels from existing datasets initially labelled for animal pose estimation. The developed models were evaluated across various scenarios involving unseen quadruped species in familiar and unfamiliar habitats. As a real-world scenario, we used a dataset of manually labelled Eurasian lynx (Lynx lynx) from camera traps in the Bavarian Forest National Park, Germany, to evaluate the model. The best model on data obtained in the field was trained on a MobileNetV2 architecture. It achieved an accuracy of 91.7% for the unseen/untrained species lynx in a complex unseen/untrained habitat with challenging light conditions. The developed flank predictor was designed to be embedded as a preprocessing step for automated analysis of camera trap datasets to enhance tasks such as individual identification.

两侧的不对称,即具有独特视觉标记的动物的侧面是独立的图案,使个体识别等任务复杂化。在动物可见的一面自动生成额外的信息将提高个体识别的准确性。在这项研究中,我们在流行的卷积神经网络(CNN)图像分类架构上使用迁移学习来训练一个侧面预测器,该预测器可以预测图像中四足哺乳动物物种的可见侧面。我们从现有的数据集中自动获得数据标签,这些数据集最初标记为动物姿态估计。开发的模型在各种场景下进行评估,包括在熟悉和不熟悉的栖息地中看不见的四足动物。作为一个现实世界的场景,我们使用了德国巴伐利亚森林国家公园相机陷阱中手动标记的欧亚猞猁(猞猁)数据集来评估该模型。在MobileNetV2架构上对现场获得的数据的最佳模型进行了训练。在一个复杂的未见/未训练的栖息地,在具有挑战性的光线条件下,它对未见/未训练的猞猁物种的准确率达到了91.7%。开发的侧翼预测器被设计为嵌入作为相机陷阱数据集自动分析的预处理步骤,以增强个人识别等任务。
{"title":"CNN-Based Flank Predictor for Quadruped Animal Species","authors":"Vanessa Suessle,&nbsp;Marco Heurich,&nbsp;Colleen T. Downs,&nbsp;Andreas Weinmann,&nbsp;Elke Hergenroether","doi":"10.1049/cvi2.70024","DOIUrl":"10.1049/cvi2.70024","url":null,"abstract":"<p>The bilateral asymmetry of flanks, where the sides of an animal with unique visual markings are independently patterned, complicates tasks such as individual identification. Automatically generating additional information on the visible side of the animal would improve the accuracy of individual identification. In this study, we used transfer learning on popular convolutional neural network (CNN) image classification architectures to train a flank predictor that predicted the visible flank of quadruped mammalian species in images. We automatically derived the data labels from existing datasets initially labelled for animal pose estimation. The developed models were evaluated across various scenarios involving unseen quadruped species in familiar and unfamiliar habitats. As a real-world scenario, we used a dataset of manually labelled Eurasian lynx (<i>Lynx lynx</i>) from camera traps in the Bavarian Forest National Park, Germany, to evaluate the model. The best model on data obtained in the field was trained on a MobileNetV2 architecture. It achieved an accuracy of 91.7% for the unseen/untrained species lynx in a complex unseen/untrained habitat with challenging light conditions. The developed flank predictor was designed to be embedded as a preprocessing step for automated analysis of camera trap datasets to enhance tasks such as individual identification.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70024","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143889173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Generated-bbox Guided Interactive Image Segmentation With Vision Transformers 基于视觉变形器的生成盒引导交互式图像分割
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-24 DOI: 10.1049/cvi2.70019
Shiyin Zhang, Yafei Dong, Shuang Qiu

Existing click-based interactive image segmentation methods typically initiate object extraction with the first click and iteratively refine the coarse segmentation through subsequent interactions. Unlike box-based methods, click-based approaches mitigate ambiguity when multiple targets are present within a single bounding box, but suffer from a lack of precise location and outline information. Inspired by instance segmentation, the authors propose a Generated-bbox Guided method that provides location and outline information using an automatically generated bounding box, rather than a manually labelled one, minimising the need for extensive user interaction. Building on the success of vision transformers, the authors adopt them as the network architecture to enhance model's performance. A click-based interactive image segmentation network named the Generated-bbox Guided Coarse-to-Fine Network (GCFN) was proposed. GCFN is a two-stage cascade network comprising two sub-networks: Coarsenet and Finenet. A transformer-based Box Detector was introduced to generate an initial bounding box from a inside click, that can provide location and outline information. Additionally, two feature enhancement modules guided by foreground and background information: the Foreground-Background Feature Enhancement Module (FFEM) and the Pixel Enhancement Module (PEM) were designed. The authors evaluate the GCFN method on five popular benchmark datasets and demonstrate the generalisation capability on three medical image datasets.

现有的基于点击的交互式图像分割方法通常通过第一次点击来启动目标提取,并通过后续的交互迭代来完善粗略分割。与基于框的方法不同,基于点击的方法可以在单个边界框内出现多个目标时缓解模糊性,但却缺乏精确的位置和轮廓信息。受实例分割的启发,作者提出了一种 "生成框引导 "方法,该方法使用自动生成的边界框提供位置和轮廓信息,而不是手动标注的边界框,从而最大限度地减少了大量用户交互的需要。基于视觉转换器的成功经验,作者采用视觉转换器作为网络架构,以提高模型的性能。作者提出了一种基于点击的交互式图像分割网络,名为 "生成框引导的粗到细网络(GCFN)"。GCFN 是一个两级级联网络,由两个子网络组成:粗网络和细网络。GCFN 引入了一个基于变压器的框检测器,用于从内部点击生成一个初始边界框,该边界框可提供位置和轮廓信息。此外,还设计了两个由前景和背景信息引导的特征增强模块:前景-背景特征增强模块(FFEM)和像素增强模块(PEM)。作者在五个常用基准数据集上对 GCFN 方法进行了评估,并在三个医学图像数据集上展示了该方法的通用能力。
{"title":"The Generated-bbox Guided Interactive Image Segmentation With Vision Transformers","authors":"Shiyin Zhang,&nbsp;Yafei Dong,&nbsp;Shuang Qiu","doi":"10.1049/cvi2.70019","DOIUrl":"10.1049/cvi2.70019","url":null,"abstract":"<p>Existing click-based interactive image segmentation methods typically initiate object extraction with the first click and iteratively refine the coarse segmentation through subsequent interactions. Unlike box-based methods, click-based approaches mitigate ambiguity when multiple targets are present within a single bounding box, but suffer from a lack of precise location and outline information. Inspired by instance segmentation, the authors propose a Generated-bbox Guided method that provides location and outline information using an automatically generated bounding box, rather than a manually labelled one, minimising the need for extensive user interaction. Building on the success of vision transformers, the authors adopt them as the network architecture to enhance model's performance. A click-based interactive image segmentation network named the Generated-bbox Guided Coarse-to-Fine Network (GCFN) was proposed. GCFN is a two-stage cascade network comprising two sub-networks: Coarsenet and Finenet. A transformer-based Box Detector was introduced to generate an initial bounding box from a inside click, that can provide location and outline information. Additionally, two feature enhancement modules guided by foreground and background information: the Foreground-Background Feature Enhancement Module (FFEM) and the Pixel Enhancement Module (PEM) were designed. The authors evaluate the GCFN method on five popular benchmark datasets and demonstrate the generalisation capability on three medical image datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143865768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structure-Based Uncertainty Estimation for Source-Free Active Domain Adaptation 无源主动域自适应中基于结构的不确定性估计
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-16 DOI: 10.1049/cvi2.70020
Jihong Ouyang, Zhengjie Zhang, Qingyi Meng, Jinjin Chi

Active domain adaptation (active DA) provides an effective solution by selectively labelling a limited number of target samples to significantly enhance adaptation performance. However, existing active DA methods often struggle in real-world scenarios where, due to data privacy concerns, only a pre-trained source model is available, rather than the source samples. To address this issue, we propose a novel method called the structure-based uncertainty estimation model (SUEM) for source-free active domain adaptation (SFADA). To be specific, we introduce an innovative active sample selection strategy that combines both uncertainty and diversity sampling to identify the most informative samples. We assess the uncertainty in target samples using structure-wise probabilities and implement a diversity selection method to minimise redundancy. For the selected samples, we not only apply standard-supervised loss but also conduct interpolation consistency training to further explore the structural information of the target domain. Extensive experiments across four widely used datasets demonstrate that our method is comparable to or outperforms current UDA and active DA methods.

主动域自适应(Active domain adaptation, Active DA)是一种有效的解决方案,它可以选择性地标记有限数量的目标样本,从而显著提高自适应性能。然而,现有的主动数据分析方法在现实场景中经常遇到困难,由于数据隐私问题,只有预训练的源模型可用,而不是源样本。为了解决这一问题,我们提出了一种基于结构的不确定性估计模型(SUEM),用于无源主动域自适应(SFADA)。具体来说,我们引入了一种创新的主动样本选择策略,该策略结合了不确定性和多样性采样来识别最具信息量的样本。我们使用结构概率评估目标样本的不确定性,并实现多样性选择方法以最小化冗余。对于选择的样本,我们不仅应用标准监督损失,还进行插值一致性训练,进一步挖掘目标域的结构信息。在四个广泛使用的数据集上进行的大量实验表明,我们的方法与当前的UDA和主动DA方法相当或优于后者。
{"title":"Structure-Based Uncertainty Estimation for Source-Free Active Domain Adaptation","authors":"Jihong Ouyang,&nbsp;Zhengjie Zhang,&nbsp;Qingyi Meng,&nbsp;Jinjin Chi","doi":"10.1049/cvi2.70020","DOIUrl":"10.1049/cvi2.70020","url":null,"abstract":"<p>Active domain adaptation (active DA) provides an effective solution by selectively labelling a limited number of target samples to significantly enhance adaptation performance. However, existing active DA methods often struggle in real-world scenarios where, due to data privacy concerns, only a pre-trained source model is available, rather than the source samples. To address this issue, we propose a novel method called the structure-based uncertainty estimation model (SUEM) for source-free active domain adaptation (SFADA). To be specific, we introduce an innovative active sample selection strategy that combines both uncertainty and diversity sampling to identify the most informative samples. We assess the uncertainty in target samples using structure-wise probabilities and implement a diversity selection method to minimise redundancy. For the selected samples, we not only apply standard-supervised loss but also conduct interpolation consistency training to further explore the structural information of the target domain. Extensive experiments across four widely used datasets demonstrate that our method is comparable to or outperforms current UDA and active DA methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70020","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143840855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Synchronised and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition 基于骨架的模糊动作识别的同步和细粒度头部
IF 1.3 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-04-15 DOI: 10.1049/cvi2.70016
Hao Huang, Yujie Lin, Siyu Chen, Haiyang Liu

Skeleton-based action recognition using Graph Convolutional Networks (GCNs) has achieved remarkable performance, but recognising ambiguous actions, such as ‘waving’ and ‘saluting’, remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and Temporal Convolutional Networks (TCNs), where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasise local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, the authors propose a lightweight plug-and-play module called Synchronised and Fine-grained Head (SF-Head), inserted between GCN and TCN layers. SF-Head first conducts Synchronised Spatial-Temporal Extraction (SSTE) with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction between the two types of features. It then performs Adaptive Cross-dimensional Feature Aggregation (AC-FA), with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. This aggregation step effectively combines both global context and local details, enhancing the model's ability to classify ambiguous actions. Experimental results on NTU RGB + D 60, NTU RGB + D 120, NW-UCLA and PKU-MMD I datasets demonstrate significant improvements in distinguishing ambiguous actions. Our code will be made available at https://github.com/HaoHuang2003/SFHead.

使用图卷积网络(GCNs)的基于骨架的动作识别已经取得了显著的成绩,但识别模棱两可的动作,如“挥手”和“敬礼”,仍然是一个重大挑战。现有方法通常依赖于GCNs和时间卷积网络(Temporal Convolutional Networks, TCNs)的串行组合,其中空间和时间特征是独立提取的,导致时空信息不平衡,阻碍了准确的动作识别。此外,现有的模糊动作方法往往过分强调局部细节,导致失去关键的全局上下文,这进一步复杂化了区分模糊动作的任务。为了应对这些挑战,作者提出了一种轻量级即插即用模块,称为同步和细粒度头(SF-Head),插入GCN和TCN层之间。SF-Head首先使用特征冗余损失(F-RL)进行同步时空提取(SSTE),确保两种类型特征之间的平衡交互。然后,它执行自适应跨维特征聚合(AC-FA),并使用特征一致性损失(F-CL)将聚合的特征与原始时空特征对齐。这个聚合步骤有效地结合了全局上下文和局部细节,增强了模型对模糊行为进行分类的能力。在NTU RGB + d60、NTU RGB + d120、NW-UCLA和PKU-MMD I数据集上的实验结果表明,在识别模糊动作方面有显著提高。我们的代码将在https://github.com/HaoHuang2003/SFHead上提供。
{"title":"Synchronised and Fine-Grained Head for Skeleton-Based Ambiguous Action Recognition","authors":"Hao Huang,&nbsp;Yujie Lin,&nbsp;Siyu Chen,&nbsp;Haiyang Liu","doi":"10.1049/cvi2.70016","DOIUrl":"10.1049/cvi2.70016","url":null,"abstract":"<p>Skeleton-based action recognition using Graph Convolutional Networks (GCNs) has achieved remarkable performance, but recognising ambiguous actions, such as ‘waving’ and ‘saluting’, remains a significant challenge. Existing methods typically rely on a serial combination of GCNs and Temporal Convolutional Networks (TCNs), where spatial and temporal features are extracted independently, leading to an unbalanced spatial-temporal information, which hinders accurate action recognition. Moreover, existing methods for ambiguous actions often overemphasise local details, resulting in the loss of crucial global context, which further complicates the task of differentiating ambiguous actions. To address these challenges, the authors propose a lightweight plug-and-play module called Synchronised and Fine-grained Head (SF-Head), inserted between GCN and TCN layers. SF-Head first conducts Synchronised Spatial-Temporal Extraction (SSTE) with a Feature Redundancy Loss (F-RL), ensuring a balanced interaction between the two types of features. It then performs Adaptive Cross-dimensional Feature Aggregation (AC-FA), with a Feature Consistency Loss (F-CL), which aligns the aggregated feature with their original spatial-temporal feature. This aggregation step effectively combines both global context and local details, enhancing the model's ability to classify ambiguous actions. Experimental results on NTU RGB + D 60, NTU RGB + D 120, NW-UCLA and PKU-MMD I datasets demonstrate significant improvements in distinguishing ambiguous actions. Our code will be made available at https://github.com/HaoHuang2003/SFHead.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.70016","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143835867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IET Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1