首页 > 最新文献

IET Computer Vision最新文献

英文 中文
FastFaceCLIP: A lightweight text-driven high-quality face image manipulation FastFaceCLIP:轻量级文本驱动的高质量人脸图像处理工具
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-02 DOI: 10.1049/cvi2.12295
Jiaqi Ren, Junping Qin, Qianli Ma, Yin Cao

Although many new methods have emerged in text-driven images, the large computational power required for model training causes these methods to have a slow training process. Additionally, these methods consume a considerable amount of video random access memory (VRAM) resources during training. When generating high-resolution images, the VRAM resources are often insufficient, which results in the inability to generate high-resolution images. Nevertheless, recent Vision Transformers (ViTs) advancements have demonstrated their image classification and recognition capabilities. Unlike the traditional Convolutional Neural Networks based methods, ViTs have a Transformer-based architecture, leverage attention mechanisms to capture comprehensive global information, moreover enabling enhanced global understanding of images through inherent long-range dependencies, thus extracting more robust features and achieving comparable results with reduced computational load. The adaptability of ViTs to text-driven image manipulation was investigated. Specifically, existing image generation methods were refined and the FastFaceCLIP method was proposed by combining the image-text semantic alignment function of the pre-trained CLIP model with the high-resolution image generation function of the proposed FastFace. Additionally, the Multi-Axis Nested Transformer module was incorporated for advanced feature extraction from the latent space, generating higher-resolution images that are further enhanced using the Real-ESRGAN algorithm. Eventually, extensive face manipulation-related tests on the CelebA-HQ dataset challenge the proposed method and other related schemes, demonstrating that FastFaceCLIP effectively generates semantically accurate, visually realistic, and clear images using fewer parameters and less time.

虽然在文本驱动图像领域出现了许多新方法,但模型训练所需的计算能力很大,导致这些方法的训练过程很慢。此外,这些方法在训练过程中会消耗大量视频随机存取存储器(VRAM)资源。在生成高分辨率图像时,VRAM 资源往往不足,导致无法生成高分辨率图像。不过,视觉转换器(ViTs)的最新进展已经证明了其图像分类和识别能力。与传统的基于卷积神经网络的方法不同,ViTs 采用基于变换器的架构,利用注意力机制捕捉全面的全局信息,并通过固有的长距离依赖关系增强对图像的全局理解,从而提取更强大的特征,并在减少计算负荷的情况下实现可比的结果。研究了 ViTs 对文本驱动图像处理的适应性。具体而言,对现有的图像生成方法进行了改进,并通过将预先训练的 CLIP 模型的图像-文本语义配准功能与所提出的 FastFace 的高分辨率图像生成功能相结合,提出了 FastFaceCLIP 方法。此外,还加入了多轴嵌套变换器模块,用于从潜空间进行高级特征提取,生成更高分辨率的图像,并使用 Real-ESRGAN 算法对图像进行进一步增强。最终,在 CelebA-HQ 数据集上进行的大量人脸操作相关测试对所提出的方法和其他相关方案提出了挑战,证明 FastFaceCLIP 能有效地生成语义准确、视觉逼真和清晰的图像,而且参数更少、时间更短。
{"title":"FastFaceCLIP: A lightweight text-driven high-quality face image manipulation","authors":"Jiaqi Ren,&nbsp;Junping Qin,&nbsp;Qianli Ma,&nbsp;Yin Cao","doi":"10.1049/cvi2.12295","DOIUrl":"10.1049/cvi2.12295","url":null,"abstract":"<p>Although many new methods have emerged in text-driven images, the large computational power required for model training causes these methods to have a slow training process. Additionally, these methods consume a considerable amount of video random access memory (VRAM) resources during training. When generating high-resolution images, the VRAM resources are often insufficient, which results in the inability to generate high-resolution images. Nevertheless, recent Vision Transformers (ViTs) advancements have demonstrated their image classification and recognition capabilities. Unlike the traditional Convolutional Neural Networks based methods, ViTs have a Transformer-based architecture, leverage attention mechanisms to capture comprehensive global information, moreover enabling enhanced global understanding of images through inherent long-range dependencies, thus extracting more robust features and achieving comparable results with reduced computational load. The adaptability of ViTs to text-driven image manipulation was investigated. Specifically, existing image generation methods were refined and the FastFaceCLIP method was proposed by combining the image-text semantic alignment function of the pre-trained CLIP model with the high-resolution image generation function of the proposed FastFace. Additionally, the Multi-Axis Nested Transformer module was incorporated for advanced feature extraction from the latent space, generating higher-resolution images that are further enhanced using the Real-ESRGAN algorithm. Eventually, extensive face manipulation-related tests on the CelebA-HQ dataset challenge the proposed method and other related schemes, demonstrating that FastFaceCLIP effectively generates semantically accurate, visually realistic, and clear images using fewer parameters and less time.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"950-967"},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12295","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141687557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PSANet: Automatic colourisation using position-spatial attention for natural images PSANet:利用位置空间注意力为自然图像自动着色
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-06-16 DOI: 10.1049/cvi2.12291
Peng-Jie Zhu, Yuan-Yuan Pu, Qiuxia Yang, Siqi Li, Zheng-Peng Zhao, Hao Wu, Dan Xu

Due to the richness of natural image semantics, natural image colourisation is a challenging problem. Existing methods often suffer from semantic confusion due to insufficient semantic understanding, resulting in unreasonable colour assignments, especially at the edges of objects. This phenomenon is referred to as colour bleeding. The authors have found that using the self-attention mechanism benefits the model's understanding and recognition of object semantics. However, this leads to another problem in colourisation, namely dull colour. With this in mind, a Position-Spatial Attention Network(PSANet) is proposed to address the colour bleeding and the dull colour. Firstly, a novel new attention module called position-spatial attention module (PSAM) is introduced. Through the proposed PSAM module, the model enhances the semantic understanding of images while solving the dull colour problem caused by self-attention. Then, in order to further prevent colour bleeding on object boundaries, a gradient-aware loss is proposed. Lastly, the colour bleeding phenomenon is further improved by the combined effect of gradient-aware loss and edge-aware loss. Experimental results show that this method can reduce colour bleeding largely while maintaining good perceptual quality.

由于自然图像语义的丰富性,自然图像着色是一个具有挑战性的问题。由于对语义的理解不够,现有的方法经常会出现语义混乱,导致颜色分配不合理,尤其是在物体的边缘。这种现象被称为 "渗色"。作者发现,使用自我关注机制有利于模型对物体语义的理解和识别。然而,这也导致了色彩化的另一个问题,即色彩暗淡。有鉴于此,我们提出了一种位置-空间注意力网络(PSANet)来解决渗色和颜色暗淡的问题。首先,我们引入了一个新颖的注意力模块--位置空间注意力模块(PSAM)。通过所提出的 PSAM 模块,该模型增强了对图像的语义理解,同时解决了由自我注意力引起的色彩暗淡问题。然后,为了进一步防止物体边界上的颜色渗漏,提出了一种梯度感知损失(gradient-aware loss)。最后,通过梯度感知损耗和边缘感知损耗的共同作用,进一步改善了渗色现象。实验结果表明,这种方法可以在很大程度上减少渗色现象,同时保持良好的感知质量。
{"title":"PSANet: Automatic colourisation using position-spatial attention for natural images","authors":"Peng-Jie Zhu,&nbsp;Yuan-Yuan Pu,&nbsp;Qiuxia Yang,&nbsp;Siqi Li,&nbsp;Zheng-Peng Zhao,&nbsp;Hao Wu,&nbsp;Dan Xu","doi":"10.1049/cvi2.12291","DOIUrl":"https://doi.org/10.1049/cvi2.12291","url":null,"abstract":"<p>Due to the richness of natural image semantics, natural image colourisation is a challenging problem. Existing methods often suffer from semantic confusion due to insufficient semantic understanding, resulting in unreasonable colour assignments, especially at the edges of objects. This phenomenon is referred to as colour bleeding. The authors have found that using the self-attention mechanism benefits the model's understanding and recognition of object semantics. However, this leads to another problem in colourisation, namely dull colour. With this in mind, a Position-Spatial Attention Network(PSANet) is proposed to address the colour bleeding and the dull colour. Firstly, a novel new attention module called position-spatial attention module (PSAM) is introduced. Through the proposed PSAM module, the model enhances the semantic understanding of images while solving the dull colour problem caused by self-attention. Then, in order to further prevent colour bleeding on object boundaries, a gradient-aware loss is proposed. Lastly, the colour bleeding phenomenon is further improved by the combined effect of gradient-aware loss and edge-aware loss. Experimental results show that this method can reduce colour bleeding largely while maintaining good perceptual quality.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"922-934"},"PeriodicalIF":1.5,"publicationDate":"2024-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12291","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge distillation of face recognition via attention cosine similarity review 通过注意力余弦相似性审查提炼人脸识别知识
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-05-31 DOI: 10.1049/cvi2.12288
Zhuo Wang, SuWen Zhao, WanYi Guo

Deep learning-based face recognition models have demonstrated remarkable performance in benchmark tests, and knowledge distillation technology has been frequently accustomed to obtain high-precision real-time face recognition models specifically designed for mobile and embedded devices. However, in recent years, the knowledge distillation methods for face recognition, which mainly focus on feature or logit knowledge distillation techniques, neglect the attention mechanism that play an important role in the domain of neural networks. An innovation cross-stage connection review path of the attention cosine similarity knowledge distillation method that unites the attention mechanism with review knowledge distillation method is proposed. This method transfers the attention map obtained from the teacher network to the student through a cross-stage connection path. The efficacy and excellence of the proposed algorithm are demonstrated in popular benchmark tests.

基于深度学习的人脸识别模型在基准测试中表现出了不俗的性能,知识蒸馏技术也经常被用来获得专为移动和嵌入式设备设计的高精度实时人脸识别模型。然而,近年来用于人脸识别的知识提炼方法主要集中在特征或对数知识提炼技术上,忽略了在神经网络领域发挥重要作用的注意力机制。本文提出了一种创新的跨阶段连接审查路径的注意力余弦相似性知识提炼方法,将注意力机制与审查知识提炼方法结合起来。该方法通过跨阶段连接路径将从教师网络获得的注意力图谱传递给学生。在流行的基准测试中证明了所提算法的有效性和卓越性。
{"title":"Knowledge distillation of face recognition via attention cosine similarity review","authors":"Zhuo Wang,&nbsp;SuWen Zhao,&nbsp;WanYi Guo","doi":"10.1049/cvi2.12288","DOIUrl":"https://doi.org/10.1049/cvi2.12288","url":null,"abstract":"<p>Deep learning-based face recognition models have demonstrated remarkable performance in benchmark tests, and knowledge distillation technology has been frequently accustomed to obtain high-precision real-time face recognition models specifically designed for mobile and embedded devices. However, in recent years, the knowledge distillation methods for face recognition, which mainly focus on feature or logit knowledge distillation techniques, neglect the attention mechanism that play an important role in the domain of neural networks. An innovation cross-stage connection review path of the attention cosine similarity knowledge distillation method that unites the attention mechanism with review knowledge distillation method is proposed. This method transfers the attention map obtained from the teacher network to the student through a cross-stage connection path. The efficacy and excellence of the proposed algorithm are demonstrated in popular benchmark tests.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"875-887"},"PeriodicalIF":1.5,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12288","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SkatingVerse: A large-scale benchmark for comprehensive evaluation on human action understanding SkatingVerse:全面评估人类动作理解的大规模基准
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-05-30 DOI: 10.1049/cvi2.12287
Ziliang Gan, Lei Jin, Yi Cheng, Yu Cheng, Yinglei Teng, Zun Li, Yawen Li, Wenhan Yang, Zheng Zhu, Junliang Xing, Jian Zhao

Human action understanding (HAU) is a broad topic that involves specific tasks, such as action localisation, recognition, and assessment. However, most popular HAU datasets are bound to one task based on particular actions. Combining different but relevant HAU tasks to establish a unified action understanding system is challenging due to the disparate actions across datasets. A large-scale and comprehensive benchmark, namely SkatingVerse is constructed for action recognition, segmentation, proposal, and assessment. SkatingVerse focus on fine-grained sport action, hence figure skating is chosen as the task object, which eliminates the biases of the object, scene, and space that exist in most previous datasets. In addition, skating actions have inherent complexity and similarity, which is an enormous challenge for current algorithms. A total of 1687 official figure skating competition videos was collected with a total of 184.4 h, exceeding four times over other datasets with a similar topic. SkatingVerse enables to formulate a unified task to output fine-grained human action classification and assessment results from a raw figure skating competition video. In addition, SkatingVerse can facilitate the study of HAU foundation model due to its large scale and abundant categories. Moreover, image modality is incorporated for human pose estimation task into SkatingVerse. Extensive experimental results show that (1) SkatingVerse significantly helps the training and evaluation of HAU methods, (2) the performance of existing HAU methods has much room to improve, and SkatingVerse helps to reduce such gaps, and (3) unifying relevant tasks in HAU through a uniform dataset can facilitate more practical applications. SkatingVerse will be publicly available to facilitate further studies on relevant problems.

人类动作理解(HAU)是一个广泛的主题,涉及动作定位、识别和评估等具体任务。然而,大多数流行的 HAU 数据集都是基于特定动作的任务。由于数据集中的动作各不相同,因此结合不同但相关的 HAU 任务来建立统一的动作理解系统具有挑战性。我们构建了一个大规模的综合基准,即 SkatingVerse,用于动作识别、分割、建议和评估。SkatingVerse 专注于细粒度的运动动作,因此选择了花样滑冰作为任务对象,从而消除了之前大多数数据集中存在的对象、场景和空间的偏差。此外,花滑动作本身具有复杂性和相似性,这对当前的算法是一个巨大的挑战。我们共收集了 1687 个官方花样滑冰比赛视频,总时长达到 184.4 小时,是其他类似主题数据集的四倍之多。SkatingVerse 能够制定统一的任务,从原始花样滑冰比赛视频中输出精细的人体动作分类和评估结果。此外,SkatingVerse 的规模大、类别多,有助于 HAU 基础模型的研究。此外,SkatingVerse 还采用了图像模式来完成人体姿势估计任务。广泛的实验结果表明:(1)SkatingVerse 对 HAU 方法的训练和评估有很大帮助;(2)现有 HAU 方法的性能还有很大提升空间,SkatingVerse 有助于缩小这些差距;(3)通过统一的数据集统一 HAU 中的相关任务可以促进更多的实际应用。SkatingVerse 将向公众开放,以促进对相关问题的进一步研究。
{"title":"SkatingVerse: A large-scale benchmark for comprehensive evaluation on human action understanding","authors":"Ziliang Gan,&nbsp;Lei Jin,&nbsp;Yi Cheng,&nbsp;Yu Cheng,&nbsp;Yinglei Teng,&nbsp;Zun Li,&nbsp;Yawen Li,&nbsp;Wenhan Yang,&nbsp;Zheng Zhu,&nbsp;Junliang Xing,&nbsp;Jian Zhao","doi":"10.1049/cvi2.12287","DOIUrl":"https://doi.org/10.1049/cvi2.12287","url":null,"abstract":"<p>Human action understanding (HAU) is a broad topic that involves specific tasks, such as action localisation, recognition, and assessment. However, most popular HAU datasets are bound to one task based on particular actions. Combining different but relevant HAU tasks to establish a unified action understanding system is challenging due to the disparate actions across datasets. A large-scale and comprehensive benchmark, namely <b>SkatingVerse</b> is constructed for action recognition, segmentation, proposal, and assessment. SkatingVerse focus on fine-grained sport action, hence figure skating is chosen as the task object, which eliminates the biases of the object, scene, and space that exist in most previous datasets. In addition, skating actions have inherent complexity and similarity, which is an enormous challenge for current algorithms. A total of 1687 official figure skating competition videos was collected with a total of 184.4 h, exceeding four times over other datasets with a similar topic. SkatingVerse enables to formulate a unified task to output fine-grained human action classification and assessment results from a raw figure skating competition video. In addition, <i>SkatingVerse</i> can facilitate the study of HAU foundation model due to its large scale and abundant categories. Moreover, image modality is incorporated for human pose estimation task into <i>SkatingVerse</i>. Extensive experimental results show that (1) SkatingVerse significantly helps the training and evaluation of HAU methods, (2) the performance of existing HAU methods has much room to improve, and SkatingVerse helps to reduce such gaps, and (3) unifying relevant tasks in HAU through a uniform dataset can facilitate more practical applications. SkatingVerse will be publicly available to facilitate further studies on relevant problems.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"888-906"},"PeriodicalIF":1.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12287","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Federated finger vein presentation attack detection for various clients 针对各种客户端的联合手指静脉呈现攻击检测
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-05-30 DOI: 10.1049/cvi2.12292
Hengyu Mu, Jian Guo, Xingli Liu, Chong Han, Lijuan Sun

Recently, the application of finger vein recognition has become popular. Studies have shown finger vein presentation attacks increasingly threaten these recognition devices. As a result, research on finger vein presentation attack detection (fvPAD) methods has received much attention. However, the current fvPAD methods have two limitations. (1) Most terminal devices cannot train fvPAD models independently due to a lack of data. (2) Several research institutes can train fvPAD models; however, these models perform poorly when applied to terminal devices due to inadequate generalisation. Consequently, it is difficult for threatened terminal devices to obtain an effective fvPAD model. To address this problem, the method of federated finger vein presentation attack detection for various clients is proposed, which is the first study that introduces federated learning (FL) to fvPAD. In the proposed method, the differences in data volume and computing power between clients are considered. Traditional FL clients are expanded into two categories: institutional and terminal clients. For institutional clients, an improved triplet training mode with FL is designed to enhance model generalisation. For terminal clients, their inability is solved to obtain effective fvPAD models. Finally, extensive experiments are conducted on three datasets, which demonstrate the superiority of our method.

最近,手指静脉识别的应用变得越来越流行。研究表明,手指静脉呈现攻击对这些识别设备的威胁越来越大。因此,有关手指静脉呈现攻击检测(fvPAD)方法的研究受到了广泛关注。然而,目前的 fvPAD 方法有两个局限性。(1) 由于缺乏数据,大多数终端设备无法独立训练 fvPAD 模型。(2) 一些研究机构可以训练 fvPAD 模型,但这些模型在应用于终端设备时由于泛化不足而表现不佳。因此,受到威胁的终端设备很难获得有效的 fvPAD 模型。针对这一问题,我们提出了针对不同客户端的联合手指静脉呈现攻击检测方法,这是首次将联合学习(FL)引入 fvPAD 的研究。在提出的方法中,考虑到了不同客户端在数据量和计算能力上的差异。传统的 FL 客户机扩展为两类:机构客户机和终端客户机。针对机构客户,设计了一种改进的三元组训练模式,以提高模型的泛化能力。对于终端客户,则解决了其无法获得有效 fvPAD 模型的问题。最后,我们在三个数据集上进行了大量实验,证明了我们方法的优越性。
{"title":"Federated finger vein presentation attack detection for various clients","authors":"Hengyu Mu,&nbsp;Jian Guo,&nbsp;Xingli Liu,&nbsp;Chong Han,&nbsp;Lijuan Sun","doi":"10.1049/cvi2.12292","DOIUrl":"https://doi.org/10.1049/cvi2.12292","url":null,"abstract":"<p>Recently, the application of finger vein recognition has become popular. Studies have shown finger vein presentation attacks increasingly threaten these recognition devices. As a result, research on finger vein presentation attack detection (fvPAD) methods has received much attention. However, the current fvPAD methods have two limitations. (1) Most terminal devices cannot train fvPAD models independently due to a lack of data. (2) Several research institutes can train fvPAD models; however, these models perform poorly when applied to terminal devices due to inadequate generalisation. Consequently, it is difficult for threatened terminal devices to obtain an effective fvPAD model. To address this problem, the method of federated finger vein presentation attack detection for various clients is proposed, which is the first study that introduces federated learning (FL) to fvPAD. In the proposed method, the differences in data volume and computing power between clients are considered. Traditional FL clients are expanded into two categories: institutional and terminal clients. For institutional clients, an improved triplet training mode with FL is designed to enhance model generalisation. For terminal clients, their inability is solved to obtain effective fvPAD models. Finally, extensive experiments are conducted on three datasets, which demonstrate the superiority of our method.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"935-949"},"PeriodicalIF":1.5,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12292","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Eigenspectrum regularisation reverse neighbourhood discriminative learning 特征谱正则化反向邻域判别学习
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-05-14 DOI: 10.1049/cvi2.12284
Ming Xie, Hengliang Tan, Jiao Du, Shuo Yang, Guofeng Yan, Wangwang Li, Jianwei Feng

Linear discriminant analysis is a classical method for solving problems of dimensional reduction and pattern classification. Although it has been extensively developed, however, it still suffers from various common problems, such as the Small Sample Size (SSS) and the multimodal problem. Neighbourhood linear discriminant analysis (nLDA) was recently proposed to solve the problem of multimodal class caused by the contravention of independently and identically distributed samples. However, due to the existence of many small-scale practical applications, nLDA still has to face the SSS problem, which leads to instability and poor generalisation caused by the singularity of the within-neighbourhood scatter matrix. The authors exploit the eigenspectrum regularisation techniques to circumvent the singularity of the within-neighbourhood scatter matrix of nLDA, which is called Eigenspectrum Regularisation Reverse Neighbourhood Discriminative Learning (ERRNDL). The algorithm of nLDA is reformulated as a framework by searching two projection matrices. Three eigenspectrum regularisation models are introduced to our framework to evaluate the performance. Experiments are conducted on the University of California, Irvine machine learning repository and six image classification datasets. The proposed ERRNDL-based methods achieve considerable performance.

线性判别分析是解决降维和模式分类问题的经典方法。虽然线性判别分析已经得到了广泛的发展,但它仍然存在各种常见问题,例如小样本量(SSS)和多模态问题。最近,有人提出了邻域线性判别分析(nLDA)来解决因样本不独立且同分布而导致的多模态分类问题。然而,由于存在许多小规模的实际应用,nLDA 仍然不得不面对 SSS 问题,即邻域内散点矩阵的奇异性导致的不稳定性和概括性差。作者利用高光谱正则化技术规避了 nLDA 邻域内散点矩阵的奇异性,并将其称为高光谱正则化反向邻域判别学习(ERRNDL)。通过搜索两个投影矩阵,nLDA 的算法被重新表述为一个框架。我们的框架引入了三种特征谱正则化模型来评估其性能。实验在加州大学欧文分校机器学习库和六个图像分类数据集上进行。所提出的基于ERRNDL的方法取得了可观的性能。
{"title":"Eigenspectrum regularisation reverse neighbourhood discriminative learning","authors":"Ming Xie,&nbsp;Hengliang Tan,&nbsp;Jiao Du,&nbsp;Shuo Yang,&nbsp;Guofeng Yan,&nbsp;Wangwang Li,&nbsp;Jianwei Feng","doi":"10.1049/cvi2.12284","DOIUrl":"10.1049/cvi2.12284","url":null,"abstract":"<p>Linear discriminant analysis is a classical method for solving problems of dimensional reduction and pattern classification. Although it has been extensively developed, however, it still suffers from various common problems, such as the Small Sample Size (SSS) and the multimodal problem. Neighbourhood linear discriminant analysis (nLDA) was recently proposed to solve the problem of multimodal class caused by the contravention of independently and identically distributed samples. However, due to the existence of many small-scale practical applications, nLDA still has to face the SSS problem, which leads to instability and poor generalisation caused by the singularity of the within-neighbourhood scatter matrix. The authors exploit the eigenspectrum regularisation techniques to circumvent the singularity of the within-neighbourhood scatter matrix of nLDA, which is called Eigenspectrum Regularisation Reverse Neighbourhood Discriminative Learning (ERRNDL). The algorithm of nLDA is reformulated as a framework by searching two projection matrices. Three eigenspectrum regularisation models are introduced to our framework to evaluate the performance. Experiments are conducted on the University of California, Irvine machine learning repository and six image classification datasets. The proposed ERRNDL-based methods achieve considerable performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"842-858"},"PeriodicalIF":1.5,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12284","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140980457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CLaSP: Cross-view 6-DoF localisation assisted by synthetic panorama CLaSP: 由合成全景图辅助的跨视角 6-DoF 定位系统
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-05-13 DOI: 10.1049/cvi2.12285
Juelin Zhu, Shen Yan, Xiaoya Cheng, Rouwan Wu, Yuxiang Liu, Maojun Zhang

Despite the impressive progress in visual localisation, 6-DoF cross-view localisation is still a challenging task in the computer vision community due to the huge appearance changes. To address this issue, the authors propose the CLaSP, a coarse-to-fine framework, which leverages a synthetic panorama to facilitate cross-view 6-DoF localisation in a large-scale scene. The authors first leverage a segmentation map to correct the prior pose, followed by a synthetic panorama on the ground to enable coarse pose estimation combined with a template matching method. The authors finally formulate the refine localisation process as feature matching and pose refinement to obtain the final result. The authors evaluate the performance of the CLaSP and several state-of-the-art baselines on the Airloc dataset, which demonstrates the effectiveness of our proposed framework.

尽管视觉定位技术取得了令人瞩目的进展,但由于外观变化巨大,6-DoF 跨视角定位仍然是计算机视觉领域的一项挑战性任务。为了解决这个问题,作者提出了一个从粗到细的框架--CLaSP,该框架利用合成全景图来促进大规模场景中的跨视角 6-DoF 定位。作者首先利用分割图修正先验姿态,然后利用地面合成全景图结合模板匹配方法进行粗姿态估计。最后,作者将细化定位过程表述为特征匹配和姿态细化,以获得最终结果。作者在 Airloc 数据集上评估了 CLaSP 和几种最先进基线的性能,证明了我们提出的框架的有效性。
{"title":"CLaSP: Cross-view 6-DoF localisation assisted by synthetic panorama","authors":"Juelin Zhu,&nbsp;Shen Yan,&nbsp;Xiaoya Cheng,&nbsp;Rouwan Wu,&nbsp;Yuxiang Liu,&nbsp;Maojun Zhang","doi":"10.1049/cvi2.12285","DOIUrl":"10.1049/cvi2.12285","url":null,"abstract":"<p>Despite the impressive progress in visual localisation, 6-DoF cross-view localisation is still a challenging task in the computer vision community due to the huge appearance changes. To address this issue, the authors propose the CLaSP, a coarse-to-fine framework, which leverages a synthetic panorama to facilitate cross-view 6-DoF localisation in a large-scale scene. The authors first leverage a segmentation map to correct the prior pose, followed by a synthetic panorama on the ground to enable coarse pose estimation combined with a template matching method. The authors finally formulate the refine localisation process as feature matching and pose refinement to obtain the final result. The authors evaluate the performance of the CLaSP and several state-of-the-art baselines on the <i>Airloc</i> dataset, which demonstrates the effectiveness of our proposed framework.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 7","pages":"859-874"},"PeriodicalIF":1.5,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12285","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140986129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Guest Editorial: Advanced image restoration and enhancement in the wild 特邀社论:野生图像的高级修复和增强
IF 1.7 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-19 DOI: 10.1049/cvi2.12283
Longguang Wang, Juncheng Li, Naoto Yokoya, Radu Timofte, Yulan Guo
<p>Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.</p><p>In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.</p><p>The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.</p><p>Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world dataset
图像复原和增强一直是计算机视觉领域的一项基本任务,被广泛应用于监控成像、遥感和医疗成像等众多领域。近年来,深度学习技术取得了显著进展。尽管在合成数据上取得了可喜的成绩,但在实际应用中仍有许多紧迫的研究挑战有待解决。这些挑战包括(i) 现实世界中低质量图像的降解模型既复杂又未知;(ii) 现实世界中难以获得成对的低质量和高质量数据,而大量真实数据是以非成对形式提供的;(iii) 将先进成像技术(如 RGB-D 相机)提供的跨模态信息纳入深度学习技术具有挑战性。(iv) 边缘设备上的实时推理对于图像修复和增强方法非常重要,以及 (v) 很难提供基于学习的方法在不同图像/区域上的置信度或性能边界。本特刊诚邀在数据集、创新架构和图像修复与增强训练方法方面的原创性文章,以应对上述挑战和其他挑战。在本特刊中,我们共收到 17 篇论文,其中 8 篇经过了同行评审,其余论文被退回。在这些经过评审的论文中,5 篇论文被接受,3 篇论文因不符合 IET 计算机视觉标准而被拒绝。最终录用的 5 篇论文可分为两类,即视频重建和图像超分辨率。第一类论文旨在重建高质量视频。第二类论文研究图像超分辨率任务。本特刊中每篇论文的简要介绍如下:Zhang 等人提出了一种用于基于事件的帧插值的点图像融合网络。事件流中的时间信息在这项任务中起着至关重要的作用,因为它提供了与图像互补的时间上下文线索。以往的方法通常通过体素化将非结构化事件数据转换为结构化数据格式,然后采用高级 CNN 提取时间信息。然而,象素化操作不可避免地会导致信息丢失,并引入冗余计算。针对这些局限性,本文提出的方法不依赖任何体素化操作,而是直接从点级别的事件中提取时间信息。然后,采用融合模块从点和图像中汇总互补线索,进行帧插值。在合成数据集和真实数据集上的实验表明,他们的方法能以高效率达到最先进的精度。为了在视频重建过程中利用相邻帧之间的时间线索,以前的大多数方法通常在初始重建之间进行对齐。然而,估计的运动通常过于粗糙,无法提供准确的时间信息。为了解决这个问题,所提出的网络采用了堆叠时移重建块来逐步增强初始重建。在每个块内,除了计算开销外,还使用高效的时移操作来捕捉时间结构。然后,采用双向对齐模块来捕捉视频序列中的时间依赖性。与以往只从关键帧中提取补充信息的方法不同,所提出的配准模块可通过双向传播从整个视频序列中接收时间信息。Qu 等人提出了一种具有三尺度编码-解码结构的轻量级视频帧插值网络。具体来说,首先从输入视频中提取多尺度运动信息。然后,采用递归卷积层来提炼结果特征。然后,对结果特征进行聚合,生成高质量的插值帧。在 CelebA 和 Helen 数据集上的实验结果表明,所提出的方法在使用较少参数的情况下优于最先进的方法。之前的大多数方法都采用多任务学习范式,在对低分辨率图像进行超分辨率处理的同时进行地标检测。然而,这些方法需要额外的注释成本,而且提取的面部先验结构通常质量不高。
{"title":"Guest Editorial: Advanced image restoration and enhancement in the wild","authors":"Longguang Wang,&nbsp;Juncheng Li,&nbsp;Naoto Yokoya,&nbsp;Radu Timofte,&nbsp;Yulan Guo","doi":"10.1049/cvi2.12283","DOIUrl":"https://doi.org/10.1049/cvi2.12283","url":null,"abstract":"&lt;p&gt;Image restoration and enhancement has always been a fundamental task in computer vision and is widely used in numerous applications, such as surveillance imaging, remote sensing, and medical imaging. In recent years, remarkable progress has been witnessed with deep learning techniques. Despite the promising performance achieved on synthetic data, compelling research challenges remain to be addressed in the wild. These include: (i) degradation models for low-quality images in the real world are complicated and unknown, (ii) paired low-quality and high-quality data are difficult to acquire in the real world, and a large quantity of real data are provided in an unpaired form, (iii) it is challenging to incorporate cross-modal information provided by advanced imaging techniques (e.g. RGB-D camera) for image restoration, (iv) real-time inference on edge devices is important for image restoration and enhancement methods, and (v) it is difficult to provide the confidence or performance bounds of a learning-based method on different images/regions. This special issue invites original contributions in datasets, innovative architectures, and training methods for image restoration and enhancement to address these and other challenges.&lt;/p&gt;&lt;p&gt;In this Special Issue, we have received 17 papers, of which 8 papers underwent the peer review process, while the rest were desk-rejected. Among these reviewed papers, 5 papers have been accepted and 3 papers have been rejected as they did not meet the criteria of IET Computer Vision. Thus, the overall submissions were of high quality, which marks the success of this Special Issue.&lt;/p&gt;&lt;p&gt;The five eventually accepted papers can be clustered into two categories, namely video reconstruction and image super-resolution. The first category of papers aims at reconstructing high-quality videos. The papers in this category are of Zhang et al., Gu et al., and Xu et al. The second category of papers studies the task of image super-resolution. The papers in this category are of Dou et al. and Yang et al. A brief presentation of each of the paper in this special issue is as follows.&lt;/p&gt;&lt;p&gt;Zhang et al. propose a point-image fusion network for event-based frame interpolation. Temporal information in event streams plays a critical role in this task as it provides temporal context cues complementary to images. Previous approaches commonly transform the unstructured event data to structured data formats through voxelisation and then employ advanced CNNs to extract temporal information. However, the voxelisation operation inevitably leads to information loss and introduces redundant computation. To address these limitations, the proposed method directly extracts temporal information from the events at the point level without relying on any voxelisation operation. Afterwards, a fusion module is adopted to aggregate complementary cues from both points and images for frame interpolation. Experiments on both synthetic and real-world dataset","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 4","pages":"435-438"},"PeriodicalIF":1.7,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12283","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141246088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal channel reconfiguration multi-graph convolution network for skeleton-based action recognition 基于骨架的动作识别时态信道重构多图卷积网络
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-17 DOI: 10.1049/cvi2.12279
Siyue Lei, Bin Tang, Yanhua Chen, Mingfu Zhao, Yifei Xu, Zourong Long

Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.

基于骨架的动作识别在人类动作识别领域受到广泛关注,并取得了显著成就。在不同尺度的时间序列动作预测中,现有方法主要关注注意力机制,以增强空间维度的建模能力。然而,这种方法严重依赖于单一输入特征的局部信息,无法促进通道间的信息流动。为了解决这些问题,作者提出了一种新颖的时空通道重构多图卷积网络(TRMGCN)。在时空卷积部分,作者设计了一个名为 "带引导的时空信道融合(TCFG)"的模块,以捕捉不同尺度信道内的重要时空信息,避免忽略关节点之间的跨时空依赖关系。在图卷积部分,作者提出了自上而下注意力多图独立卷积(TD-MIG),它使用多图独立卷积来学习不同长度时间序列的拓扑图特征。在空间和信道调制中引入了自上而下注意,以促进不建立拓扑关系的信道中的信息流动。在大型数据集 NTU-RGB + D60 和 120 以及 UAV-Human 上的实验结果表明,TRMGCN 具有先进的性能和能力。此外,在较小数据集 NW-UCLA 上的实验结果表明,作者的模型具有很强的泛化能力。
{"title":"Temporal channel reconfiguration multi-graph convolution network for skeleton-based action recognition","authors":"Siyue Lei,&nbsp;Bin Tang,&nbsp;Yanhua Chen,&nbsp;Mingfu Zhao,&nbsp;Yifei Xu,&nbsp;Zourong Long","doi":"10.1049/cvi2.12279","DOIUrl":"10.1049/cvi2.12279","url":null,"abstract":"<p>Skeleton-based action recognition has received much attention and achieved remarkable achievements in the field of human action recognition. In time series action prediction for different scales, existing methods mainly focus on attention mechanisms to enhance modelling capabilities in spatial dimensions. However, this approach strongly depends on the local information of a single input feature and fails to facilitate the flow of information between channels. To address these issues, the authors propose a novel Temporal Channel Reconfiguration Multi-Graph Convolution Network (TRMGCN). In the temporal convolution part, the authors designed a module called Temporal Channel Fusion with Guidance (TCFG) to capture important temporal information within channels at different scales and avoid ignoring cross-spatio-temporal dependencies among joints. In the graph convolution part, the authors propose Top-Down Attention Multi-graph Independent Convolution (TD-MIG), which uses multi-graph independent convolution to learn the topological graph feature for different length time series. Top-down attention is introduced for spatial and channel modulation to facilitate information flow in channels that do not establish topological relationships. Experimental results on the large-scale datasets NTU-RGB + D60 and 120, as well as UAV-Human, demonstrate that TRMGCN exhibits advanced performance and capabilities. Furthermore, experiments on the smaller dataset NW-UCLA have indicated that the authors’ model possesses strong generalisation abilities.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"813-825"},"PeriodicalIF":1.5,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12279","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140693975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Instance segmentation by blend U-Net and VOLO network 通过混合 U-Net 和 VOLO 网络进行实例分割
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-04-09 DOI: 10.1049/cvi2.12275
Hongfei Deng, Bin Wen, Rui Wang, Zuwei Feng

Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.

要在重叠、密集和大量的目标对象上正确区分不同的实例,实例分割仍然是一个难题。为此,作者将实例分割问题简化为实例分类问题,并提出了一种新颖的端到端训练型实例分割算法 CotuNet。首先,该算法结合了卷积神经网络(CNN)、Outlooker 和 Transformer,设计出一种新的混合编码器(COT),以进一步提取特征。它包括使用 CNN 提取图像的低级特征,然后通过 Outlooker 提取更精细的局部数据表示。然后,利用变换器将数据表示聚合到本地空间,生成全局上下文信息。最后,级联上采样和跳接模块的组合被用作解码器(C-UP),以实现多个不同尺度的高分辨率信息的融合,从而生成准确的掩码。通过在 CVPPP 2017 数据集上进行验证,并与之前最先进的方法进行比较,CotuNet 显示出卓越的竞争力和分割性能。
{"title":"Instance segmentation by blend U-Net and VOLO network","authors":"Hongfei Deng,&nbsp;Bin Wen,&nbsp;Rui Wang,&nbsp;Zuwei Feng","doi":"10.1049/cvi2.12275","DOIUrl":"10.1049/cvi2.12275","url":null,"abstract":"<p>Instance segmentation is still challengeable to correctly distinguish different instances on overlapping, dense and large number of target objects. To address this, the authors simplify the instance segmentation problem to an instance classification problem and propose a novel end-to-end trained instance segmentation algorithm CotuNet. Firstly, the algorithm combines convolutional neural networks (CNN), Outlooker and Transformer to design a new hybrid Encoder (COT) to further feature extraction. It consists of extracting low-level features of the image using CNN, which is passed through the Outlooker to extract more refined local data representations. Then global contextual information is generated by aggregating the data representations in local space using Transformer. Finally, the combination of cascaded upsampling and skip connection modules is used as Decoders (C-UP) to enable the blend of multiple different scales of high-resolution information to generate accurate masks. By validating on the CVPPP 2017 dataset and comparing with previous state-of-the-art methods, CotuNet shows superior competitiveness and segmentation performance.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"18 6","pages":"735-744"},"PeriodicalIF":1.5,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12275","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140726439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IET Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1