首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
A semantic segmentation method integrated convolutional nonlinear spiking neural model with Transformer 将卷积非线性尖峰神经模型与变压器整合在一起的语义分割方法
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-09 DOI: 10.1016/j.cviu.2024.104196
Siyan Sun , Wenqian Yang , Hong Peng , Jun Wang , Zhicai Liu
Semantic segmentation is a critical task in computer vision, with significant applications in areas like autonomous driving and medical imaging. Transformer-based methods have gained considerable attention recently because of their strength in capturing global information. However, these methods often sacrifice detailed information due to the lack of mechanisms for local interactions. Similarly, convolutional neural network (CNN) methods struggle to capture global context due to the inherent limitations of convolutional kernels. To overcome these challenges, this paper introduces a novel Transformer-based semantic segmentation method called NSNPFormer, which leverages the nonlinear spiking neural P (NSNP) system—a computational model inspired by the spiking mechanisms of biological neurons. The NSNPFormer employs an encoding–decoding structure with two convolutional NSNP components and a residual connection channel. The convolutional NSNP components facilitate nonlinear local feature extraction and block-level feature fusion. Meanwhile, the residual connection channel helps prevent the loss of feature information during the decoding process. Evaluations on the ADE20K and Pascal Context datasets show that NSNPFormer achieves mIoU scores of 53.7 and 58.06, respectively, highlighting its effectiveness in semantic segmentation tasks.
语义分割是计算机视觉中的一项关键任务,在自动驾驶和医学成像等领域有着重要应用。基于变换器的方法因其在捕捉全局信息方面的优势而在最近获得了广泛关注。然而,由于缺乏局部交互机制,这些方法往往会牺牲细节信息。同样,由于卷积核的固有局限性,卷积神经网络(CNN)方法也难以捕捉全局上下文。为了克服这些挑战,本文介绍了一种名为 NSNPFormer 的基于 Transformer 的新型语义分割方法,该方法利用了非线性尖峰神经 P(NSNP)系统--一种受生物神经元尖峰机制启发的计算模型。NSNPFormer 采用的编码-解码结构包含两个卷积 NSNP 组件和一个残差连接通道。卷积 NSNP 组件有助于非线性局部特征提取和块级特征融合。同时,残差连接通道有助于防止在解码过程中丢失特征信息。在 ADE20K 和 Pascal Context 数据集上进行的评估表明,NSNPFormer 的 mIoU 分数分别达到了 53.7 和 58.06,突出了它在语义分割任务中的有效性。
{"title":"A semantic segmentation method integrated convolutional nonlinear spiking neural model with Transformer","authors":"Siyan Sun ,&nbsp;Wenqian Yang ,&nbsp;Hong Peng ,&nbsp;Jun Wang ,&nbsp;Zhicai Liu","doi":"10.1016/j.cviu.2024.104196","DOIUrl":"10.1016/j.cviu.2024.104196","url":null,"abstract":"<div><div>Semantic segmentation is a critical task in computer vision, with significant applications in areas like autonomous driving and medical imaging. Transformer-based methods have gained considerable attention recently because of their strength in capturing global information. However, these methods often sacrifice detailed information due to the lack of mechanisms for local interactions. Similarly, convolutional neural network (CNN) methods struggle to capture global context due to the inherent limitations of convolutional kernels. To overcome these challenges, this paper introduces a novel Transformer-based semantic segmentation method called NSNPFormer, which leverages the nonlinear spiking neural P (NSNP) system—a computational model inspired by the spiking mechanisms of biological neurons. The NSNPFormer employs an encoding–decoding structure with two convolutional NSNP components and a residual connection channel. The convolutional NSNP components facilitate nonlinear local feature extraction and block-level feature fusion. Meanwhile, the residual connection channel helps prevent the loss of feature information during the decoding process. Evaluations on the ADE20K and Pascal Context datasets show that NSNPFormer achieves mIoU scores of 53.7 and 58.06, respectively, highlighting its effectiveness in semantic segmentation tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104196"},"PeriodicalIF":4.3,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142433245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MT-DSNet: Mix-mask teacher–student strategies and dual dynamic selection plug-in module for fine-grained image recognition MT-DSNet:用于细粒度图像识别的师生混合掩码策略和双动态选择插件模块
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-08 DOI: 10.1016/j.cviu.2024.104201
Hongchun Lu, Min Han
The fine-grained image recognition (FGIR) task aims to classify and distinguish subtle differences between subcategories with visually similar appearances, such as bird species and the makes or models of vehicles. However, subtle interclass differences and significant intraclass variances lead to poor model recognition performance. To address these challenges, we developed a mixed-mask teacher–student cooperative training strategy. A mixed masked image is generated and embedded into a knowledge distillation network by replacing one image’s visible marker with another’s masked marker. Collaborative reinforcement between teachers and students is used to improve the recognition performance of the network. We chose the classic transformer architecture as a baseline to better explore the contextual relationships between features. Additionally, we suggest a dual dynamic selection plug-in for choosing features with discriminative capabilities in the spatial and channel dimensions and filter out irrelevant interference information to efficiently handle background and noise features in fine-grained images. The proposed feature suppression module is used to enhance the differences between different features, thereby motivating the network to mine more discriminative features. We validated our method using two datasets: CUB-200-2011 and Stanford Cars. The experimental results show that the proposed MT-DSNet can significantly improve the feature representation for FGIR tasks. Moreover, by applying it to different fine-grained networks, the FGIR accuracy can be improved without changing the original network structure. We hope that this work provides a promising approach for improving the feature representation of networks in the future.
细粒度图像识别(FGIR)任务旨在分类和区分视觉外观相似的子类别之间的细微差别,例如鸟的种类和汽车的品牌或型号。然而,细微的类间差异和显著的类内差异导致模型识别性能低下。为了应对这些挑战,我们开发了一种混合掩码师生合作训练策略。通过用一个图像的可见标记替换另一个图像的屏蔽标记,生成混合屏蔽图像并将其嵌入知识提炼网络。教师和学生之间通过合作强化来提高网络的识别性能。我们选择了经典的变换器架构作为基线,以更好地探索特征之间的上下文关系。此外,我们还提出了一个双动态选择插件,用于选择在空间和通道维度上具有分辨能力的特征,并过滤掉无关的干扰信息,以有效处理细粒度图像中的背景和噪声特征。所提出的特征抑制模块用于增强不同特征之间的差异,从而促使网络挖掘出更多具有分辨能力的特征。我们使用两个数据集验证了我们的方法:CUB-200-2011 和斯坦福汽车。实验结果表明,所提出的 MT-DSNet 可以显著改善 FGIR 任务的特征表示。此外,通过将其应用于不同的细粒度网络,可以在不改变原始网络结构的情况下提高 FGIR 的准确性。我们希望这项工作能为未来改进网络特征表示提供一种有前途的方法。
{"title":"MT-DSNet: Mix-mask teacher–student strategies and dual dynamic selection plug-in module for fine-grained image recognition","authors":"Hongchun Lu,&nbsp;Min Han","doi":"10.1016/j.cviu.2024.104201","DOIUrl":"10.1016/j.cviu.2024.104201","url":null,"abstract":"<div><div>The fine-grained image recognition (FGIR) task aims to classify and distinguish subtle differences between subcategories with visually similar appearances, such as bird species and the makes or models of vehicles. However, subtle interclass differences and significant intraclass variances lead to poor model recognition performance. To address these challenges, we developed a mixed-mask teacher–student cooperative training strategy. A mixed masked image is generated and embedded into a knowledge distillation network by replacing one image’s visible marker with another’s masked marker. Collaborative reinforcement between teachers and students is used to improve the recognition performance of the network. We chose the classic transformer architecture as a baseline to better explore the contextual relationships between features. Additionally, we suggest a dual dynamic selection plug-in for choosing features with discriminative capabilities in the spatial and channel dimensions and filter out irrelevant interference information to efficiently handle background and noise features in fine-grained images. The proposed feature suppression module is used to enhance the differences between different features, thereby motivating the network to mine more discriminative features. We validated our method using two datasets: CUB-200-2011 and Stanford Cars. The experimental results show that the proposed MT-DSNet can significantly improve the feature representation for FGIR tasks. Moreover, by applying it to different fine-grained networks, the FGIR accuracy can be improved without changing the original network structure. We hope that this work provides a promising approach for improving the feature representation of networks in the future.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104201"},"PeriodicalIF":4.3,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyperspectral image classification with token fusion on GPU 利用 GPU 进行标记融合的高光谱图像分类
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-05 DOI: 10.1016/j.cviu.2024.104198
He Huang, Sha Tao
Hyperspectral images capture material nuances with spectral data, vital for remote sensing. Transformer has become a mainstream approach for tackling the challenges posed by high-dimensional hyperspectral data with complex structures. However, a major challenge they face when processing hyperspectral images is the presence of a large number of redundant tokens, which leads to a significant increase in computational load, adding to the model’s computational burden and affecting inference speed. Therefore, we propose a token fusion algorithm tailored to the operational characteristics of the hyperspectral image and pure transformer network, aimed at enhancing the final accuracy and throughput of the model. The token fusion algorithm introduces a token merging step between the attention mechanism and the multi-layer perceptron module in each Transformer layer. Experiments on four hyperspectral image datasets demonstrate that our token fusion algorithm can significantly improve inference speed without any training, while only causing a slight decrease in the pure transformer network’s classification accuracy.
高光谱图像通过光谱数据捕捉物质的细微差别,这对遥感至关重要。变换器已成为应对结构复杂的高维高光谱数据挑战的主流方法。然而,在处理高光谱图像时,它们面临的一个主要挑战是存在大量冗余标记,这会导致计算负荷大幅增加,加重模型的计算负担并影响推理速度。因此,我们根据高光谱图像和纯变压器网络的运行特点,提出了一种令牌融合算法,旨在提高模型的最终精度和吞吐量。令牌融合算法在每个变压器层的注意力机制和多层感知器模块之间引入了令牌合并步骤。在四个高光谱图像数据集上进行的实验表明,我们的标记融合算法无需任何训练即可显著提高推理速度,同时只会导致纯变换器网络的分类准确率略有下降。
{"title":"Hyperspectral image classification with token fusion on GPU","authors":"He Huang,&nbsp;Sha Tao","doi":"10.1016/j.cviu.2024.104198","DOIUrl":"10.1016/j.cviu.2024.104198","url":null,"abstract":"<div><div>Hyperspectral images capture material nuances with spectral data, vital for remote sensing. Transformer has become a mainstream approach for tackling the challenges posed by high-dimensional hyperspectral data with complex structures. However, a major challenge they face when processing hyperspectral images is the presence of a large number of redundant tokens, which leads to a significant increase in computational load, adding to the model’s computational burden and affecting inference speed. Therefore, we propose a token fusion algorithm tailored to the operational characteristics of the hyperspectral image and pure transformer network, aimed at enhancing the final accuracy and throughput of the model. The token fusion algorithm introduces a token merging step between the attention mechanism and the multi-layer perceptron module in each Transformer layer. Experiments on four hyperspectral image datasets demonstrate that our token fusion algorithm can significantly improve inference speed without any training, while only causing a slight decrease in the pure transformer network’s classification accuracy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104198"},"PeriodicalIF":4.3,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring event-based human pose estimation with 3D event representations 利用三维事件表征探索基于事件的人体姿态估计
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-05 DOI: 10.1016/j.cviu.2024.104189
Xiaoting Yin , Hao Shi , Jiaan Chen , Ze Wang , Yaozu Ye , Kailun Yang , Kaiwei Wang
Human pose estimation is a fundamental and appealing task in computer vision. Although traditional cameras are commonly applied, their reliability decreases in scenarios under high dynamic range or heavy motion blur, where event cameras offer a robust solution. Predominant event-based methods accumulate events into frames, ignoring the asynchronous and high temporal resolution that is crucial for distinguishing distinct actions. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC aggregates events within concise temporal slices at identical positions, preserving their 3D attributes along with statistical information, thereby significantly reducing memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. Our methods are tested on the DHP19 public dataset, MMHPSD dataset, and our EV-3DPW dataset, with further qualitative validation via a derived driving scene dataset EV-JAAD and an outdoor collection vehicle. Our code and dataset have been made publicly available at https://github.com/MasterHow/EventPointPose.
人体姿态估计是计算机视觉中一项基本而有吸引力的任务。虽然传统相机得到了普遍应用,但在高动态范围或严重运动模糊的情况下,其可靠性会降低,而事件相机则提供了一种稳健的解决方案。基于事件的主流方法将事件累积成帧,忽略了对区分不同动作至关重要的异步和高时间分辨率。为了解决这一问题并释放事件信息的三维潜力,我们引入了两种三维事件表示法:光栅化事件点云(RasEPC)和解耦事件体素(DEV)。RasEPC 将事件聚集在相同位置的简明时间切片中,保留了事件的三维属性和统计信息,从而大大降低了内存和计算需求。同时,DEV 表示法将事件离散化为体素,并将它们投射到三个正交平面上,利用解耦事件注意力从二维平面上检索三维线索。此外,我们还开发并发布了 EV-3DPW,这是一个基于事件的合成数据集,便于在室外场景中进行训练和定量分析。我们的方法在 DHP19 公开数据集、MMHPSD 数据集和 EV-3DPW 数据集上进行了测试,并通过衍生驾驶场景数据集 EV-JAAD 和室外采集车辆进行了进一步的定性验证。我们的代码和数据集已在 https://github.com/MasterHow/EventPointPose 上公开发布。
{"title":"Exploring event-based human pose estimation with 3D event representations","authors":"Xiaoting Yin ,&nbsp;Hao Shi ,&nbsp;Jiaan Chen ,&nbsp;Ze Wang ,&nbsp;Yaozu Ye ,&nbsp;Kailun Yang ,&nbsp;Kaiwei Wang","doi":"10.1016/j.cviu.2024.104189","DOIUrl":"10.1016/j.cviu.2024.104189","url":null,"abstract":"<div><div>Human pose estimation is a fundamental and appealing task in computer vision. Although traditional cameras are commonly applied, their reliability decreases in scenarios under high dynamic range or heavy motion blur, where event cameras offer a robust solution. Predominant event-based methods accumulate events into frames, ignoring the asynchronous and high temporal resolution that is crucial for distinguishing distinct actions. To address this issue and to unlock the 3D potential of event information, we introduce two 3D event representations: the Rasterized Event Point Cloud (RasEPC) and the Decoupled Event Voxel (DEV). The RasEPC aggregates events within concise temporal slices at identical positions, preserving their 3D attributes along with statistical information, thereby significantly reducing memory and computational demands. Meanwhile, the DEV representation discretizes events into voxels and projects them across three orthogonal planes, utilizing decoupled event attention to retrieve 3D cues from the 2D planes. Furthermore, we develop and release EV-3DPW, a synthetic event-based dataset crafted to facilitate training and quantitative analysis in outdoor scenes. Our methods are tested on the DHP19 public dataset, MMHPSD dataset, and our EV-3DPW dataset, with further qualitative validation via a derived driving scene dataset EV-JAAD and an outdoor collection vehicle. Our code and dataset have been made publicly available at <span><span>https://github.com/MasterHow/EventPointPose</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104189"},"PeriodicalIF":4.3,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AWADA: Foreground-focused adversarial learning for cross-domain object detection AWADA:用于跨域物体检测的前景对抗学习
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-05 DOI: 10.1016/j.cviu.2024.104153
Maximilian Menke , Thomas Wenzel , Andreas Schwung
Object detection networks have achieved impressive results, but it can be challenging to replicate this success in practical applications due to a lack of relevant data specific to the task. Typically, additional data sources are used to support the training process. However, the domain gaps between these data sources present a challenge. Adversarial image-to-image style transfer is often used to bridge this gap, but it is not directly connected to the object detection task and can be unstable. We propose AWADA, a framework that combines attention-weighted adversarial domain adaptation connecting style transfer and object detection. By using object detector proposals to create attention maps for foreground objects, we focus the style transfer on these regions and stabilize the training process. Our results demonstrate that AWADA can reach state-of-the-art unsupervised domain adaptation performance in three commonly used benchmarks.
物体检测网络已经取得了令人瞩目的成果,但由于缺乏任务所需的相关数据,要在实际应用中复制这种成功具有挑战性。通常情况下,会使用额外的数据源来支持训练过程。然而,这些数据源之间的领域差距是一个挑战。对抗性图像到图像风格转移通常用于弥合这一差距,但它与物体检测任务没有直接联系,而且可能不稳定。我们提出的 AWADA 是一个将注意力加权对抗域适应与风格转移和物体检测相结合的框架。通过使用对象检测器建议来创建前景对象的注意力地图,我们将风格转移集中在这些区域,并稳定了训练过程。我们的研究结果表明,AWADA 可以在三个常用基准中达到最先进的无监督领域适应性能。
{"title":"AWADA: Foreground-focused adversarial learning for cross-domain object detection","authors":"Maximilian Menke ,&nbsp;Thomas Wenzel ,&nbsp;Andreas Schwung","doi":"10.1016/j.cviu.2024.104153","DOIUrl":"10.1016/j.cviu.2024.104153","url":null,"abstract":"<div><div>Object detection networks have achieved impressive results, but it can be challenging to replicate this success in practical applications due to a lack of relevant data specific to the task. Typically, additional data sources are used to support the training process. However, the domain gaps between these data sources present a challenge. Adversarial image-to-image style transfer is often used to bridge this gap, but it is not directly connected to the object detection task and can be unstable. We propose AWADA, a framework that combines attention-weighted adversarial domain adaptation connecting style transfer and object detection. By using object detector proposals to create attention maps for foreground objects, we focus the style transfer on these regions and stabilize the training process. Our results demonstrate that AWADA can reach state-of-the-art unsupervised domain adaptation performance in three commonly used benchmarks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104153"},"PeriodicalIF":4.3,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generative adversarial network for semi-supervised image captioning 半监督图像标题生成对抗网络
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-04 DOI: 10.1016/j.cviu.2024.104199
Xu Liang, Chen Li, Lihua Tian
Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image–text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline “Karpathy” test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks.
传统的有监督图像标题方法通常依赖大量图像和配对标题进行训练。然而,创建这样的数据集需要大量的时间和人力资源。因此,我们提出了一种新的半监督图像字幕算法来解决这个问题。建议的方法使用生成式对抗网络生成与标题匹配的图像,并将这些生成的图像和标题作为新的训练数据。这就避免了用自回归法生成伪标题时的误差积累问题,网络可以直接进行反向传播。同时,为了确保生成的图像和字幕之间的相关性,我们引入了 CLIP 模型进行约束。CLIP 模型已在大量图像-文本数据上进行了预训练,因此在图像和文本的语义配准方面表现出色。为了验证我们方法的有效性,我们在 MSCOCO 离线 "Karpathy "测试分片上进行了验证。实验结果表明,当使用 1%的配对数据时,我们的方法能显著提高模型的性能,CIDEr 分数从 69.5% 提高到 77.7%。这表明我们的方法可以有效地利用无标记数据来完成图像标题任务。
{"title":"Generative adversarial network for semi-supervised image captioning","authors":"Xu Liang,&nbsp;Chen Li,&nbsp;Lihua Tian","doi":"10.1016/j.cviu.2024.104199","DOIUrl":"10.1016/j.cviu.2024.104199","url":null,"abstract":"<div><div>Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image–text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline “Karpathy” test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104199"},"PeriodicalIF":4.3,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BundleMoCap++: Efficient, robust and smooth motion capture from sparse multiview videos BundleMoCap++:从稀疏的多视角视频中高效、稳健、流畅地捕捉动作
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-04 DOI: 10.1016/j.cviu.2024.104190
Georgios Albanis , Nikolaos Zioulis , Kostas Kolomvatsos
Producing smooth and accurate motions from sparse videos without requiring specialized equipment and markers is a long-standing problem in the research community. Most approaches typically involve complex processes such as temporal constraints, multiple stages combining data-driven regression and optimization techniques, and bundle solving over temporal windows. These increase the computational burden and introduce the challenge of hyperparameter tuning for the different objective terms. In contrast, BundleMoCap++ offers a simple yet effective approach to this problem. It solves the motion in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions without compromising accuracy. BundleMoCap++ outperforms the state-of-the-art without increasing complexity. Our approach is based on manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption and appropriate interpolation schemes, we efficiently solve a bundle of frames using two or more latent codes. Additionally, the method is implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap++’s strength lies in achieving high-quality motion capture results with fewer computational resources. To do this efficiently, we propose a novel human pose prior that focuses on the geometric aspect of the latent space, modeling it as a hypersphere, allowing for the introduction of sophisticated interpolation techniques. We also propose an algorithm for optimizing the latent variables directly on the learned manifold, improving convergence and performance. Finally, we introduce high-order interpolation techniques adapted for the hypersphere, allowing us to increase the solving temporal window, enhancing performance and efficiency.
无需专业设备和标记,就能从稀疏视频中生成平滑准确的运动图像,是研究界长期存在的问题。大多数方法通常涉及复杂的过程,如时间约束、结合数据驱动回归和优化技术的多阶段,以及在时间窗口上的捆绑求解。这些都增加了计算负担,并带来了针对不同目标项调整超参数的挑战。相比之下,BundleMoCap++ 为这一问题提供了一种简单而有效的方法。它只需一个阶段就能解决运动问题,无需时间平滑目标,同时还能在不影响精度的情况下实现平滑运动。在不增加复杂性的情况下,BundleMoCap++ 超越了最先进的技术。我们的方法基于潜在关键帧之间的流形插值。通过依赖局部流形平滑性假设和适当的插值方案,我们使用两个或更多潜在代码高效地解决了帧束问题。此外,该方法是以滑动窗口优化的方式实现的,只需要对第一帧进行适当的初始化,从而减轻了整体的计算负担。BundleMoCap++ 的优势在于用较少的计算资源获得高质量的运动捕捉结果。为了高效地实现这一目标,我们提出了一种新颖的人体姿态先验,该先验侧重于潜空间的几何方面,将其建模为超球,从而可以引入复杂的插值技术。我们还提出了一种直接在所学流形上优化潜变量的算法,从而提高了收敛性和性能。最后,我们引入了适用于超球的高阶插值技术,允许我们增加求解时间窗口,从而提高性能和效率。
{"title":"BundleMoCap++: Efficient, robust and smooth motion capture from sparse multiview videos","authors":"Georgios Albanis ,&nbsp;Nikolaos Zioulis ,&nbsp;Kostas Kolomvatsos","doi":"10.1016/j.cviu.2024.104190","DOIUrl":"10.1016/j.cviu.2024.104190","url":null,"abstract":"<div><div>Producing smooth and accurate motions from sparse videos without requiring specialized equipment and markers is a long-standing problem in the research community. Most approaches typically involve complex processes such as temporal constraints, multiple stages combining data-driven regression and optimization techniques, and bundle solving over temporal windows. These increase the computational burden and introduce the challenge of hyperparameter tuning for the different objective terms. In contrast, BundleMoCap++ offers a simple yet effective approach to this problem. It solves the motion in a single stage, eliminating the need for temporal smoothness objectives while still delivering smooth motions without compromising accuracy. BundleMoCap++ outperforms the state-of-the-art without increasing complexity. Our approach is based on manifold interpolation between latent keyframes. By relying on a local manifold smoothness assumption and appropriate interpolation schemes, we efficiently solve a bundle of frames using two or more latent codes. Additionally, the method is implemented as a sliding window optimization and requires only the first frame to be properly initialized, reducing the overall computational burden. BundleMoCap++’s strength lies in achieving high-quality motion capture results with fewer computational resources. To do this efficiently, we propose a novel human pose prior that focuses on the geometric aspect of the latent space, modeling it as a hypersphere, allowing for the introduction of sophisticated interpolation techniques. We also propose an algorithm for optimizing the latent variables directly on the learned manifold, improving convergence and performance. Finally, we introduce high-order interpolation techniques adapted for the hypersphere, allowing us to increase the solving temporal window, enhancing performance and efficiency.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104190"},"PeriodicalIF":4.3,"publicationDate":"2024-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142442023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel image inpainting method based on a modified Lengyel–Epstein model 基于改进的 Lengyel-Epstein 模型的新型图像着色方法
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-03 DOI: 10.1016/j.cviu.2024.104195
Jian Wang , Mengyu Luo , Xinlei Chen , Heming Xu , Junseok Kim
With the increasing popularity of digital images, developing advanced algorithms that can accurately reconstruct damaged images while maintaining high visual quality is crucial. Traditional image restoration algorithms often struggle with complex structures and details, while recent deep learning methods, though effective, face significant challenges related to high data dependency and computational costs. To resolve these challenges, we propose a novel image inpainting model, which is based on a modified Lengyel–Epstein (LE) model. We discretize the modified LE model by using an explicit Euler algorithm. A series of restoration experiments are conducted on various image types, including binary images, grayscale images, index images, and color images. The experimental results demonstrate the effectiveness and robustness of the method, and even under complex conditions of noise interference and local damage, the proposed method can exhibit excellent repair performance. To quantify the fidelity of these restored images, we use the peak signal-to-noise ratio (PSNR), a widely accepted metric in image processing. The calculation results further demonstrate the applicability of our model across different image types. Moreover, by evaluating CPU time, our method can achieve ideal repair results within a remarkably brief duration. The proposed method validates significant potential for real-world applications in diverse domains of image restoration and enhancement.
随着数字图像的日益普及,开发既能准确重建受损图像又能保持高视觉质量的先进算法至关重要。传统的图像修复算法往往难以处理复杂的结构和细节,而最新的深度学习方法虽然有效,却面临着与高数据依赖性和计算成本相关的重大挑战。为了解决这些难题,我们提出了一种基于改进的伦盖尔-爱泼斯坦(Lengyel-Epstein,LE)模型的新型图像内绘模型。我们使用显式欧拉算法对修改后的 LE 模型进行离散化。我们在二值图像、灰度图像、索引图像和彩色图像等各种类型的图像上进行了一系列修复实验。实验结果证明了该方法的有效性和鲁棒性,即使在噪声干扰和局部损伤的复杂条件下,所提出的方法也能表现出优异的修复性能。为了量化这些修复图像的保真度,我们使用了峰值信噪比 (PSNR),这是一个在图像处理中被广泛接受的指标。计算结果进一步证明了我们的模型适用于不同类型的图像。此外,通过评估 CPU 时间,我们的方法可以在极短的时间内实现理想的修复效果。所提出的方法验证了在图像修复和增强等不同领域的实际应用中的巨大潜力。
{"title":"A novel image inpainting method based on a modified Lengyel–Epstein model","authors":"Jian Wang ,&nbsp;Mengyu Luo ,&nbsp;Xinlei Chen ,&nbsp;Heming Xu ,&nbsp;Junseok Kim","doi":"10.1016/j.cviu.2024.104195","DOIUrl":"10.1016/j.cviu.2024.104195","url":null,"abstract":"<div><div>With the increasing popularity of digital images, developing advanced algorithms that can accurately reconstruct damaged images while maintaining high visual quality is crucial. Traditional image restoration algorithms often struggle with complex structures and details, while recent deep learning methods, though effective, face significant challenges related to high data dependency and computational costs. To resolve these challenges, we propose a novel image inpainting model, which is based on a modified Lengyel–Epstein (LE) model. We discretize the modified LE model by using an explicit Euler algorithm. A series of restoration experiments are conducted on various image types, including binary images, grayscale images, index images, and color images. The experimental results demonstrate the effectiveness and robustness of the method, and even under complex conditions of noise interference and local damage, the proposed method can exhibit excellent repair performance. To quantify the fidelity of these restored images, we use the peak signal-to-noise ratio (PSNR), a widely accepted metric in image processing. The calculation results further demonstrate the applicability of our model across different image types. Moreover, by evaluating CPU time, our method can achieve ideal repair results within a remarkably brief duration. The proposed method validates significant potential for real-world applications in diverse domains of image restoration and enhancement.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104195"},"PeriodicalIF":4.3,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WGS-YOLO: A real-time object detector based on YOLO framework for autonomous driving WGS-YOLO:基于 YOLO 框架的自动驾驶实时物体检测器
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-03 DOI: 10.1016/j.cviu.2024.104200
Shiqin Yue , Ziyi Zhang , Ying Shi , Yonghua Cai
The safety and reliability of autonomous driving depends on the precision and efficiency of object detection systems. In this paper, a refined adaptation of the YOLO architecture (WGS-YOLO) is developed to improve the detection of pedestrians and vehicles. Specifically, its information fusion is enhanced by incorporating the Weighted Efficient Layer Aggregation Network (W-ELAN) module, an innovative dynamic weighted feature fusion module using channel shuffling. Meanwhile, the computational demands and parameters of the proposed WGS-YOLO are significantly reduced by employing the Space-to-Depth Convolution (SPD-Conv) and the Grouped Spatial Pyramid Pooling (GSPP) modules that have been strategically designed. The performance of our model is evaluated with the BDD100k and DAIR-V2X-V datasets. In terms of mean Average Precision (mAP0.5), the proposed model outperforms the baseline Yolov7 by 12%. Furthermore, extensive experiments are conducted to verify our analysis and the model’s robustness across diverse scenarios.
自动驾驶的安全性和可靠性取决于物体检测系统的精度和效率。本文对 YOLO 架构(WGS-YOLO)进行了改进,以提高行人和车辆的检测能力。具体来说,通过加入加权高效层聚合网络(Weighted Efficient Layer Aggregation Network,W-ELAN)模块(一种使用信道洗牌的创新动态加权特征融合模块),增强了其信息融合能力。同时,通过采用战略性设计的空深卷积(SPD-Conv)和分组空间金字塔池化(GSPP)模块,大大降低了拟议 WGS-YOLO 的计算需求和参数。我们使用 BDD100k 和 DAIR-V2X-V 数据集评估了模型的性能。就平均精度(mAP0.5)而言,所提出的模型比基准 Yolov7 高出 12%。此外,我们还进行了大量实验,以验证我们的分析和模型在不同场景下的鲁棒性。
{"title":"WGS-YOLO: A real-time object detector based on YOLO framework for autonomous driving","authors":"Shiqin Yue ,&nbsp;Ziyi Zhang ,&nbsp;Ying Shi ,&nbsp;Yonghua Cai","doi":"10.1016/j.cviu.2024.104200","DOIUrl":"10.1016/j.cviu.2024.104200","url":null,"abstract":"<div><div>The safety and reliability of autonomous driving depends on the precision and efficiency of object detection systems. In this paper, a refined adaptation of the YOLO architecture (WGS-YOLO) is developed to improve the detection of pedestrians and vehicles. Specifically, its information fusion is enhanced by incorporating the Weighted Efficient Layer Aggregation Network (W-ELAN) module, an innovative dynamic weighted feature fusion module using channel shuffling. Meanwhile, the computational demands and parameters of the proposed WGS-YOLO are significantly reduced by employing the Space-to-Depth Convolution (SPD-Conv) and the Grouped Spatial Pyramid Pooling (GSPP) modules that have been strategically designed. The performance of our model is evaluated with the BDD100k and DAIR-V2X-V datasets. In terms of mean Average Precision (<span><math><msub><mrow><mtext>mAP</mtext></mrow><mrow><mn>0</mn><mo>.</mo><mn>5</mn></mrow></msub></math></span>), the proposed model outperforms the baseline Yolov7 by 12%. Furthermore, extensive experiments are conducted to verify our analysis and the model’s robustness across diverse scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104200"},"PeriodicalIF":4.3,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Found missing semantics: Supplemental prototype network for few-shot semantic segmentation 发现丢失的语义:用于少量语义分割的补充原型网络
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-03 DOI: 10.1016/j.cviu.2024.104191
Chen Liang, Shuang Bai
Few-shot semantic segmentation alleviates the problem of massive data requirements and high costs in semantic segmentation tasks. By learning from support set, few-shot semantic segmentation can segment new classes. However, existing few-shot semantic segmentation methods suffer from information loss during the process of mask average pooling. To address this problem, we propose a supplemental prototype network (SPNet). The SPNet aggregates the lost information from global prototypes to create a supplemental prototype, which enhances the segmentation performance for the current class. In addition, we utilize mutual attention to enhance the similarity between the support and the query feature maps, allowing the model to better identify the target to be segmented. Finally, we introduce a Self-correcting auxiliary, which utilizes the data more effectively to improve segmentation accuracy. We conducted extensive experiments on PASCAL-5i and COCO-20i, which demonstrated the effectiveness of SPNet. And our method achieved state-of-the-art results in the 1-shot and 5-shot semantic segmentation settings.
少量语义分割缓解了语义分割任务中的海量数据需求和高成本问题。通过从支持集学习,少量语义分割可以分割出新的类别。然而,现有的几次语义分割方法在掩码平均池化过程中存在信息丢失问题。为了解决这个问题,我们提出了一种补充原型网络(SPNet)。SPNet 汇集了全局原型丢失的信息,创建了一个补充原型,从而提高了当前类别的分割性能。此外,我们还利用相互关注来增强支持特征图和查询特征图之间的相似性,使模型能够更好地识别待分割目标。最后,我们引入了自校正辅助工具,它能更有效地利用数据来提高分割准确性。我们在 PASCAL-5i 和 COCO-20i 上进行了大量实验,证明了 SPNet 的有效性。我们的方法在 1 次和 5 次语义分割设置中都取得了一流的结果。
{"title":"Found missing semantics: Supplemental prototype network for few-shot semantic segmentation","authors":"Chen Liang,&nbsp;Shuang Bai","doi":"10.1016/j.cviu.2024.104191","DOIUrl":"10.1016/j.cviu.2024.104191","url":null,"abstract":"<div><div>Few-shot semantic segmentation alleviates the problem of massive data requirements and high costs in semantic segmentation tasks. By learning from support set, few-shot semantic segmentation can segment new classes. However, existing few-shot semantic segmentation methods suffer from information loss during the process of mask average pooling. To address this problem, we propose a supplemental prototype network (SPNet). The SPNet aggregates the lost information from global prototypes to create a supplemental prototype, which enhances the segmentation performance for the current class. In addition, we utilize mutual attention to enhance the similarity between the support and the query feature maps, allowing the model to better identify the target to be segmented. Finally, we introduce a Self-correcting auxiliary, which utilizes the data more effectively to improve segmentation accuracy. We conducted extensive experiments on PASCAL-5i and COCO-20i, which demonstrated the effectiveness of SPNet. And our method achieved state-of-the-art results in the 1-shot and 5-shot semantic segmentation settings.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104191"},"PeriodicalIF":4.3,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1