首页 > 最新文献

International Journal of Computer Vision最新文献

英文 中文
Warping the Residuals for Image Editing with StyleGAN 使用 StyleGAN 对残差进行翘曲处理以进行图像编辑
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-18 DOI: 10.1007/s11263-024-02301-6
Ahmet Burak Yildirim, Hamza Pehlivan, Aysegul Dundar

StyleGAN models show editing capabilities via their semantically interpretable latent organizations which require successful GAN inversion methods to edit real images. Many works have been proposed for inverting images into StyleGAN’s latent space. However, their results either suffer from low fidelity to the input image or poor editing qualities, especially for edits that require large transformations. That is because low bit rate latent spaces lose many image details due to the information bottleneck even though it provides an editable space. On the other hand, higher bit rate latent spaces can pass all the image details to StyleGAN for perfect reconstruction of images but suffer from low editing qualities. In this work, we present a novel image inversion architecture that extracts high-rate latent features and includes a flow estimation module to warp these features to adapt them to edits. This is because edits often involve spatial changes in the image, such as adjustments to pose or smile. Thus, high-rate latent features must be accurately repositioned to match their new locations in the edited image space. We achieve this by employing flow estimation to determine the necessary spatial adjustments, followed by warping the features to align them correctly in the edited image. Specifically, we estimate the flows from StyleGAN features of edited and unedited latent codes. By estimating the high-rate features and warping them for edits, we achieve both high-fidelity to the input image and high-quality edits. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.

StyleGAN 模型通过其语义可解释的潜在组织显示出编辑能力,这就需要成功的 GAN 反演方法来编辑真实图像。已经有很多人提出了将图像反转到 StyleGAN 潜在空间的方法。但是,它们的结果要么与输入图像的保真度较低,要么编辑质量较差,尤其是对于需要大量变换的编辑。这是因为低比特率潜空间虽然提供了一个可编辑的空间,但由于信息瓶颈而丢失了许多图像细节。另一方面,高比特率潜空间可以将所有图像细节传递给 StyleGAN,从而完美地重建图像,但编辑质量较低。在这项工作中,我们提出了一种新颖的图像反转架构,该架构可提取高比特率潜特征,并包含一个流量估计模块来扭曲这些特征,使其适应编辑。这是因为编辑通常涉及图像的空间变化,如姿势或微笑的调整。因此,必须对高速潜特征进行精确的重新定位,使其与编辑后图像空间中的新位置相匹配。为此,我们采用流量估算来确定必要的空间调整,然后对特征进行扭曲,使其在编辑后的图像中正确对齐。具体来说,我们从已编辑和未编辑潜码的 StyleGAN 特征中估算流量。通过估算高速率特征并对其进行编辑扭曲,我们实现了对输入图像的高保真和高质量编辑。我们进行了大量实验,并将我们的方法与最先进的反转方法进行了比较。定性指标和可视化比较结果表明,我们的方法有了显著的改进。
{"title":"Warping the Residuals for Image Editing with StyleGAN","authors":"Ahmet Burak Yildirim, Hamza Pehlivan, Aysegul Dundar","doi":"10.1007/s11263-024-02301-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02301-6","url":null,"abstract":"<p>StyleGAN models show editing capabilities via their semantically interpretable latent organizations which require successful GAN inversion methods to edit real images. Many works have been proposed for inverting images into StyleGAN’s latent space. However, their results either suffer from low fidelity to the input image or poor editing qualities, especially for edits that require large transformations. That is because low bit rate latent spaces lose many image details due to the information bottleneck even though it provides an editable space. On the other hand, higher bit rate latent spaces can pass all the image details to StyleGAN for perfect reconstruction of images but suffer from low editing qualities. In this work, we present a novel image inversion architecture that extracts high-rate latent features and includes a flow estimation module to warp these features to adapt them to edits. This is because edits often involve spatial changes in the image, such as adjustments to pose or smile. Thus, high-rate latent features must be accurately repositioned to match their new locations in the edited image space. We achieve this by employing flow estimation to determine the necessary spatial adjustments, followed by warping the features to align them correctly in the edited image. Specifically, we estimate the flows from StyleGAN features of edited and unedited latent codes. By estimating the high-rate features and warping them for edits, we achieve both high-fidelity to the input image and high-quality edits. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation 将目标拉向源头:领域自适应语义分割的新视角
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-16 DOI: 10.1007/s11263-024-02285-3
Haochen Wang, Yujun Shen, Jingjing Fei, Wei Li, Liwei Wu, Yuxi Wang, Zhaoxiang Zhang

Domain-adaptive semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. However, existing methods primarily focus on directly learning categorically discriminative target features for segmenting target images, which is challenging in the absence of target labels. This work provides a new perspective. We ob serve that the features learned with source data manage to keep categorically discriminative during training, thereby enabling us to implicitly learn adequate target representations by simply pulling target features close to source features for each category. To this end, we propose T2S-DA, which encourages the model to learn similar cross-domain features. Also, considering the pixel categories are heavily imbalanced for segmentation datasets, we come up with a dynamic re-weighting strategy to help the model concentrate on those underperforming classes. Extensive experiments confirm that T2S-DA learns a more discriminative and generalizable representation, significantly surpassing the state-of-the-art. We further show that T2S-DA is quite qualified for the domain generalization task, verifying its domain-invariant property.

领域自适应语义分割旨在将知识从有标签的源领域转移到无标签的目标领域。然而,现有的方法主要侧重于直接学习用于分割目标图像的分类判别目标特征,这在没有目标标签的情况下具有挑战性。这项工作提供了一个新的视角。我们发现,通过源数据学习到的特征在训练过程中能够保持分类区分度,因此我们只需将目标特征拉近每个类别的源特征,就能隐式地学习到适当的目标表征。为此,我们提出了 T2S-DA,鼓励模型学习类似的跨领域特征。此外,考虑到像素类别在分割数据集上严重失衡,我们提出了一种动态再加权策略,以帮助模型专注于那些表现不佳的类别。广泛的实验证实,T2S-DA 学习到的表征更具区分性和普适性,大大超越了最先进的水平。我们进一步证明,T2S-DA 能够胜任领域泛化任务,验证了它的领域不变性。
{"title":"Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation","authors":"Haochen Wang, Yujun Shen, Jingjing Fei, Wei Li, Liwei Wu, Yuxi Wang, Zhaoxiang Zhang","doi":"10.1007/s11263-024-02285-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02285-3","url":null,"abstract":"<p>Domain-adaptive semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. However, existing methods primarily focus on directly learning categorically discriminative target features for segmenting target images, which is challenging in the absence of target labels. This work provides a new perspective. We ob serve that the features learned with source data manage to keep categorically discriminative during training, thereby enabling us to implicitly learn adequate target representations by simply <i>pulling target features close to source features for each category</i>. To this end, we propose T2S-DA, which encourages the model to learn similar cross-domain features. Also, considering the pixel categories are heavily imbalanced for segmentation datasets, we come up with a dynamic re-weighting strategy to help the model concentrate on those underperforming classes. Extensive experiments confirm that T2S-DA learns a more discriminative and generalizable representation, significantly surpassing the state-of-the-art. We further show that T2S-DA is quite qualified for the domain generalization task, verifying its domain-invariant property.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"99 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142642626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature Matching via Graph Clustering with Local Affine Consensus 通过图聚类与局部仿射共识进行特征匹配
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-11-15 DOI: 10.1007/s11263-024-02291-5
Yifan Lu, Jiayi Ma

This paper studies graph clustering with application to feature matching and proposes an effective method, termed as GC-LAC, that can establish reliable feature correspondences and simultaneously discover all potential visual patterns. In particular, we regard each putative match as a node and encode the geometric relationships into edges where a visual pattern sharing similar motion behaviors corresponds to a strongly connected subgraph. In this setting, it is natural to formulate the feature matching task as a graph clustering problem. To construct a geometric meaningful graph, based on the best practices, we adopt a local affine strategy. By investigating the motion coherence prior, we further propose an efficient and deterministic geometric solver (MCDG) to extract the local geometric information that helps construct the graph. The graph is sparse and general for various image transformations. Subsequently, a novel robust graph clustering algorithm (D2SCAN) is introduced, which defines the notion of density-reachable on the graph by replicator dynamics optimization. Extensive experiments focusing on both the local and the whole of our GC-LAC with various practical vision tasks including relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multimodel fitting, demonstrate that our GC-LAC is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness. The source code for this work is publicly available at: https://github.com/YifanLu2000/GCLAC.

本文研究了图聚类在特征匹配中的应用,并提出了一种有效的方法(称为 GC-LAC),它可以建立可靠的特征对应关系,同时发现所有潜在的视觉模式。具体而言,我们将每个可能的匹配视为一个节点,并将几何关系编码为边,其中具有相似运动行为的视觉模式对应于一个强连接子图。在这种情况下,自然可以将特征匹配任务表述为图聚类问题。为了构建有几何意义的图,我们根据最佳实践,采用了局部仿射策略。通过研究运动一致性先验,我们进一步提出了一种高效的确定性几何求解器(MCDG),以提取有助于构建图的局部几何信息。该图稀疏且通用于各种图像变换。随后,我们引入了一种新颖的鲁棒图聚类算法(D2SCAN),该算法通过复制器动态优化定义了图上可达到的密度概念。我们的 GC-LAC 在各种实际视觉任务(包括相对姿态估算、同源性和基本矩阵估算、闭环检测和多模型拟合)中进行了广泛的局部和整体实验,证明我们的 GC-LAC 在通用性、效率和有效性方面都比目前最先进的方法更具竞争力。这项工作的源代码可在以下网址公开获取:https://github.com/YifanLu2000/GCLAC。
{"title":"Feature Matching via Graph Clustering with Local Affine Consensus","authors":"Yifan Lu, Jiayi Ma","doi":"10.1007/s11263-024-02291-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02291-5","url":null,"abstract":"<p>This paper studies graph clustering with application to feature matching and proposes an effective method, termed as GC-LAC, that can establish reliable feature correspondences and simultaneously discover all potential visual patterns. In particular, we regard each putative match as a node and encode the geometric relationships into edges where a visual pattern sharing similar motion behaviors corresponds to a strongly connected subgraph. In this setting, it is natural to formulate the feature matching task as a graph clustering problem. To construct a geometric meaningful graph, based on the best practices, we adopt a local affine strategy. By investigating the motion coherence prior, we further propose an efficient and deterministic geometric solver (MCDG) to extract the local geometric information that helps construct the graph. The graph is sparse and general for various image transformations. Subsequently, a novel robust graph clustering algorithm (D2SCAN) is introduced, which defines the notion of density-reachable on the graph by replicator dynamics optimization. Extensive experiments focusing on both the local and the whole of our GC-LAC with various practical vision tasks including relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multimodel fitting, demonstrate that our GC-LAC is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness. The source code for this work is publicly available at: https://github.com/YifanLu2000/GCLAC.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"75 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CCR: Facial Image Editing with Continuity, Consistency and Reversibility CCR:面部图像编辑与连续性,一致性和可逆性
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-14 DOI: 10.1007/s11263-023-01938-z
Nan Yang, Xin Luan, Huidi Jia, Zhi Han, Xiaofeng Li, Yandong Tang

Three problems exist in sequential facial image editing: discontinuous editing, inconsistent editing, and irreversible editing. Discontinuous editing is that the current editing can not retain the previously edited attributes. Inconsistent editing is that swapping the attribute editing orders can not yield the same results. Irreversible editing means that operating on a facial image is irreversible, especially in sequential facial image editing. In this work, we put forward three concepts and their corresponding definitions: editing continuity, consistency, and reversibility. Note that continuity refers to the continuity of attributes, that is, attributes can be continuously edited on any face. Consistency is that not only attributes meet continuity, but also facial identity needs to be consistent. To do so, we propose a novel model to achieve the goal of editing continuity, consistency, and reversibility. Furthermore, a sufficient criterion is defined to determine whether a model is continuous, consistent, and reversible. Extensive qualitative and quantitative experimental results validate our proposed model, and show that a continuous, consistent and reversible editing model has a more flexible editing function while preserving facial identity. We believe that our proposed definitions and model will have wide and promising applications in multimedia processing. Code and data are available at https://github.com/mickoluan/CCR.

人脸图像序列编辑存在三个问题:不连续编辑、不一致编辑和不可逆编辑。不连续编辑是指当前编辑不能保留以前编辑的属性。不一致编辑是指交换属性编辑顺序不能产生相同的结果。不可逆编辑是指对面部图像进行的操作是不可逆的,尤其是在连续的面部图像编辑中。在这项工作中,我们提出了三个概念及其相应的定义:编辑连续性、一致性和可逆性。注意,连续性是指属性的连续性,即属性可以在任意面上连续编辑。一致性是指不仅属性满足连续性,面部身份也需要一致性。为此,我们提出了一个新的模型,以实现编辑的连续性,一致性和可逆性的目标。此外,还定义了一个充分的准则来确定模型是否连续、一致和可逆。大量的定性和定量实验结果验证了我们提出的模型,并表明连续、一致和可逆的编辑模型在保持面部身份的同时具有更灵活的编辑功能。我们相信我们提出的定义和模型将在多媒体处理中有广泛而有前途的应用。代码和数据可在https://github.com/mickoluan/CCR上获得。
{"title":"CCR: Facial Image Editing with Continuity, Consistency and Reversibility","authors":"Nan Yang, Xin Luan, Huidi Jia, Zhi Han, Xiaofeng Li, Yandong Tang","doi":"10.1007/s11263-023-01938-z","DOIUrl":"https://doi.org/10.1007/s11263-023-01938-z","url":null,"abstract":"<p>Three problems exist in sequential facial image editing: discontinuous editing, inconsistent editing, and irreversible editing. Discontinuous editing is that the current editing can not retain the previously edited attributes. Inconsistent editing is that swapping the attribute editing orders can not yield the same results. Irreversible editing means that operating on a facial image is irreversible, especially in sequential facial image editing. In this work, we put forward three concepts and their corresponding definitions: editing continuity, consistency, and reversibility. Note that continuity refers to the continuity of attributes, that is, attributes can be continuously edited on any face. Consistency is that not only attributes meet continuity, but also facial identity needs to be consistent. To do so, we propose a novel model to achieve the goal of editing continuity, consistency, and reversibility. Furthermore, a sufficient criterion is defined to determine whether a model is continuous, consistent, and reversible. Extensive qualitative and quantitative experimental results validate our proposed model, and show that a continuous, consistent and reversible editing model has a more flexible editing function while preserving facial identity. We believe that our proposed definitions and model will have wide and promising applications in multimedia processing. Code and data are available at https://github.com/mickoluan/CCR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 5","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92158488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Robust Multi-scale Representation for Neural Radiance Fields from Unposed Images 从未处理图像中学习神经辐射场的鲁棒多尺度表示
2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-11 DOI: 10.1007/s11263-023-01936-1
Nishant Jain, Suryansh Kumar, Luc Van Gool
{"title":"Learning Robust Multi-scale Representation for Neural Radiance Fields from Unposed Images","authors":"Nishant Jain, Suryansh Kumar, Luc Van Gool","doi":"10.1007/s11263-023-01936-1","DOIUrl":"https://doi.org/10.1007/s11263-023-01936-1","url":null,"abstract":"","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"15 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135042700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Going Deeper into Recognizing Actions in Dark Environments: A Comprehensive Benchmark Study 更深入地识别黑暗环境中的行为:一项全面的基准研究
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-08 DOI: 10.1007/s11263-023-01932-5
Yuecong Xu, Haozhi Cao, Jianxiong Yin, Zhenghua Chen, Xiaoli Li, Zhengguo Li, Qianwen Xu, Jianfei Yang

While action recognition (AR) has gained large improvements with the introduction of large-scale video datasets and the development of deep neural networks, AR models robust to challenging environments in real-world scenarios are still under-explored. We focus on the task of action recognition in dark environments, which can be applied to fields such as surveillance and autonomous driving at night. Intuitively, current deep networks along with visual enhancement techniques should be able to handle AR in dark environments, however, it is observed that this is not always the case in practice. To dive deeper into exploring solutions for AR in dark environments, we launched the (hbox {UG}^{2}{+}) Challenge Track 2 (UG2-2) in IEEE CVPR 2021, with a goal of evaluating and advancing the robustness of AR models in dark environments. The challenge builds and expands on top of a novel ARID dataset, the first dataset for the task of dark video AR, and guides models to tackle such a task in both fully and semi-supervised manners. Baseline results utilizing current AR models and enhancement methods are reported, justifying the challenging nature of this task with substantial room for improvements. Thanks to the active participation from the research community, notable advances have been made in participants’ solutions, while analysis of these solutions helped better identify possible directions to tackle the challenge of AR in dark environments.

尽管随着大规模视频数据集的引入和深度神经网络的发展,动作识别(AR)得到了很大的改进,但在现实世界场景中对具有挑战性的环境具有鲁棒性的AR模型仍在探索中。我们专注于黑暗环境中的动作识别任务,它可以应用于监控和夜间自动驾驶等领域。直观地说,当前的深度网络以及视觉增强技术应该能够在黑暗环境中处理AR,然而,据观察,在实践中并不总是这样。为了更深入地探索黑暗环境中的AR解决方案,我们在IEEE CVPR 2021中启动了(hbox{UG}^{2}{+})挑战轨道2(UG2-2),目的是评估和提高AR模型在黑暗环境下的稳健性。这项挑战在一个新的ARID数据集的基础上构建和扩展,该数据集是暗视频AR任务的第一个数据集,并指导模型以完全和半监督的方式处理此类任务。报告了利用当前AR模型和增强方法的基线结果,证明了这项任务的挑战性,并有很大的改进空间。由于研究界的积极参与,参与者的解决方案取得了显著进展,而对这些解决方案的分析有助于更好地确定在黑暗环境中应对AR挑战的可能方向。
{"title":"Going Deeper into Recognizing Actions in Dark Environments: A Comprehensive Benchmark Study","authors":"Yuecong Xu, Haozhi Cao, Jianxiong Yin, Zhenghua Chen, Xiaoli Li, Zhengguo Li, Qianwen Xu, Jianfei Yang","doi":"10.1007/s11263-023-01932-5","DOIUrl":"https://doi.org/10.1007/s11263-023-01932-5","url":null,"abstract":"<p>While action recognition (AR) has gained large improvements with the introduction of large-scale video datasets and the development of deep neural networks, AR models robust to challenging environments in real-world scenarios are still under-explored. We focus on the task of action recognition in dark environments, which can be applied to fields such as surveillance and autonomous driving at night. Intuitively, current deep networks along with visual enhancement techniques should be able to handle AR in dark environments, however, it is observed that this is not always the case in practice. To dive deeper into exploring solutions for AR in dark environments, we launched the <span>(hbox {UG}^{2}{+})</span> Challenge Track 2 (UG2-2) in IEEE CVPR 2021, with a goal of evaluating and advancing the robustness of AR models in dark environments. The challenge builds and expands on top of a novel ARID dataset, the first dataset for the task of dark video AR, and guides models to tackle such a task in both fully and semi-supervised manners. Baseline results utilizing current AR models and enhancement methods are reported, justifying the challenging nature of this task with substantial room for improvements. Thanks to the active participation from the research community, notable advances have been made in participants’ solutions, while analysis of these solutions helped better identify possible directions to tackle the challenge of AR in dark environments.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"57 17","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71516823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Harmonizing Base and Novel Classes: A Class-Contrastive Approach for Generalized Few-Shot Segmentation 协调基本类和新类:一种用于广义少镜头分割的类对比方法
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-08 DOI: 10.1007/s11263-023-01939-y
Weide Liu, Zhonghua Wu, Yang Zhao, Yuming Fang, Chuan-Sheng Foo, Jun Cheng, Guosheng Lin

Current methods for few-shot segmentation (FSSeg) have mainly focused on improving the performance of novel classes while neglecting the performance of base classes. To overcome this limitation, the task of generalized few-shot semantic segmentation (GFSSeg) has been introduced, aiming to predict segmentation masks for both base and novel classes. However, the current prototype-based methods do not explicitly consider the relationship between base and novel classes when updating prototypes, leading to a limited performance in identifying true categories. To address this challenge, we propose a class contrastive loss and a class relationship loss to regulate prototype updates and encourage a large distance between prototypes from different classes, thus distinguishing the classes from each other while maintaining the performance of the base classes. Our proposed approach achieves new state-of-the-art performance for the generalized few-shot segmentation task on PASCAL VOC and MS COCO datasets.

目前的少镜头分割方法主要集中在提高新类的性能上,而忽略了基类的性能。为了克服这一限制,引入了广义少镜头语义分割(GFSSeg)的任务,旨在预测基本类和新类的分割掩码。然而,当前基于原型的方法在更新原型时没有明确考虑基类和新类之间的关系,导致识别真实类别的性能有限。为了应对这一挑战,我们提出了一种类对比损失和类关系损失来调节原型更新,并鼓励不同类的原型之间保持较大的距离,从而在保持基类性能的同时将类彼此区分开来。我们提出的方法在PASCAL VOC和MS COCO数据集上实现了广义少镜头分割任务的最新性能。
{"title":"Harmonizing Base and Novel Classes: A Class-Contrastive Approach for Generalized Few-Shot Segmentation","authors":"Weide Liu, Zhonghua Wu, Yang Zhao, Yuming Fang, Chuan-Sheng Foo, Jun Cheng, Guosheng Lin","doi":"10.1007/s11263-023-01939-y","DOIUrl":"https://doi.org/10.1007/s11263-023-01939-y","url":null,"abstract":"<p>Current methods for few-shot segmentation (FSSeg) have mainly focused on improving the performance of novel classes while neglecting the performance of base classes. To overcome this limitation, the task of generalized few-shot semantic segmentation (GFSSeg) has been introduced, aiming to predict segmentation masks for both base and novel classes. However, the current prototype-based methods do not explicitly consider the relationship between base and novel classes when updating prototypes, leading to a limited performance in identifying true categories. To address this challenge, we propose a class contrastive loss and a class relationship loss to regulate prototype updates and encourage a large distance between prototypes from different classes, thus distinguishing the classes from each other while maintaining the performance of the base classes. Our proposed approach achieves new state-of-the-art performance for the generalized few-shot segmentation task on PASCAL VOC and MS COCO datasets.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"35 21","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71524145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Universal Object Detection with Large Vision Model 基于大视觉模型的通用目标检测
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-07 DOI: 10.1007/s11263-023-01929-0
Feng Lin, Wenze Hu, Yaowei Wang, Yonghong Tian, Guangming Lu, Fanglin Chen, Yong Xu, Xiaoyu Wang

Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious second-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.

在过去的几年里,人们对开发一种广泛、通用和通用的计算机视觉系统越来越感兴趣。这样的系统有可能同时处理广泛的视觉任务,而不局限于特定的问题或数据域。这种普遍性对于实际的、真实世界的计算机视觉应用至关重要。在这项研究中,我们的重点是一个特定的挑战:大规模、多领域的通用物体检测问题,这有助于实现通用视觉系统的更广泛目标。这个问题带来了几个复杂的挑战,包括跨数据集类别的标签重复、标签冲突以及处理分层分类的必要性。为了应对这些挑战,我们介绍了我们的标签处理、层次感知损失设计和利用预先训练的大视觉模型进行资源高效模型训练的方法。我们的方法表现出了非凡的性能,在百万规模的跨数据集对象检测基准上,在2022年鲁棒视觉挑战赛(RVC 2022)的对象检测赛道上获得了著名的第二名。我们相信,我们的全面研究将成为一个有价值的参考,并为解决计算机视觉界的类似挑战提供一种替代方法。我们工作的源代码可在https://github.com/linfeng93/Large-UniDet.
{"title":"Universal Object Detection with Large Vision Model","authors":"Feng Lin, Wenze Hu, Yaowei Wang, Yonghong Tian, Guangming Lu, Fanglin Chen, Yong Xu, Xiaoyu Wang","doi":"10.1007/s11263-023-01929-0","DOIUrl":"https://doi.org/10.1007/s11263-023-01929-0","url":null,"abstract":"<p>Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious <i>second</i>-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 33","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose 用于联合预测面部标志、遮挡概率和头部姿势的级联迭代变换器
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-06 DOI: 10.1007/s11263-023-01935-2
Yaokun Li, Guang Tan, Chao Gou

Landmark detection under large pose with occlusion has been one of the challenging problems in the field of facial analysis. Recently, many works have predicted pose or occlusion together in the multi-task learning (MTL) paradigm, trying to tap into their dependencies and thus alleviate this issue. However, such implicit dependencies are weakly interpretable and inconsistent with the way humans exploit inter-task coupling relations, i.e., accommodating the induced explicit effects. This is one of the essentials that hinders their performance. To this end, in this paper, we propose a Cascaded Iterative Transformer (CIT) to jointly predict facial landmark, occlusion probability, and pose. The proposed CIT, besides implicitly mining task dependencies in a shared encoder, innovatively employs a cost-effective and portability-friendly strategy to pass the decoders’ predictions as prior knowledge to human-like exploit the coupling-induced effects. Moreover, to the best of our knowledge, no dataset contains all these task annotations simultaneously, so we introduce a new dataset termed MERL-RAV-FLOP based on the MERL-RAV dataset. We conduct extensive experiments on several challenging datasets (300W-LP, AFLW2000-3D, BIWI, COFW, and MERL-RAV-FLOP) and achieve remarkable results. The code and dataset can be accessed in https://github.com/Iron-LYK/CIT.

遮挡大姿态下的地标检测一直是人脸分析领域中具有挑战性的问题之一。最近,许多工作在多任务学习(MTL)范式中一起预测了姿势或遮挡,试图挖掘它们的相关性,从而缓解这一问题。然而,这种隐式依赖关系的可解释性很弱,与人类利用任务间耦合关系的方式不一致,即适应诱导的显式效应。这是阻碍他们表现的要素之一。为此,在本文中,我们提出了一种级联迭代变换器(CIT)来联合预测面部地标、遮挡概率和姿势。所提出的CIT除了在共享编码器中隐式挖掘任务依赖性外,还创新性地采用了一种成本效益高、可移植性好的策略,将解码器的预测作为先验知识传递给类人利用耦合诱导效应。此外,据我们所知,没有一个数据集同时包含所有这些任务注释,因此我们在MERL-RAV数据集的基础上引入了一个新的数据集,称为MERL-RAV-FLOP。我们在几个具有挑战性的数据集(300W-LP、AFLW2000-3D、BIWI、COFW和MERL-RAV-FLOP)上进行了广泛的实验,并取得了显著的结果。可以在中访问代码和数据集https://github.com/Iron-LYK/CIT.
{"title":"Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose","authors":"Yaokun Li, Guang Tan, Chao Gou","doi":"10.1007/s11263-023-01935-2","DOIUrl":"https://doi.org/10.1007/s11263-023-01935-2","url":null,"abstract":"<p>Landmark detection under large pose with occlusion has been one of the challenging problems in the field of facial analysis. Recently, many works have predicted pose or occlusion together in the multi-task learning (MTL) paradigm, trying to tap into their dependencies and thus alleviate this issue. However, such implicit dependencies are weakly interpretable and inconsistent with the way humans exploit inter-task coupling relations, i.e., accommodating the induced explicit effects. This is one of the essentials that hinders their performance. To this end, in this paper, we propose a Cascaded Iterative Transformer (CIT) to jointly predict facial landmark, occlusion probability, and pose. The proposed CIT, besides implicitly mining task dependencies in a shared encoder, innovatively employs a cost-effective and portability-friendly strategy to pass the decoders’ predictions as prior knowledge to human-like exploit the coupling-induced effects. Moreover, to the best of our knowledge, no dataset contains all these task annotations simultaneously, so we introduce a new dataset termed MERL-RAV-FLOP based on the MERL-RAV dataset. We conduct extensive experiments on several challenging datasets (300W-LP, AFLW2000-3D, BIWI, COFW, and MERL-RAV-FLOP) and achieve remarkable results. The code and dataset can be accessed in https://github.com/Iron-LYK/CIT.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"57 16","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71516824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local Compressed Video Stream Learning for Generic Event Boundary Detection 用于一般事件边界检测的局部压缩视频流学习
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2023-11-01 DOI: 10.1007/s11263-023-01921-8
Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan

Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at https://github.com/GX77/LCVSL.

通用事件边界检测旨在定位将视频分割成块的通用、无分类的事件边界。现有的方法通常需要在将视频帧馈送到网络之前对其进行解码,这包含显著的时空冗余,并且需要相当大的计算能力和存储空间。为了解决这些问题,我们提出了一种新的用于事件边界检测的压缩视频表示学习方法,该方法完全端到端地利用压缩域中的丰富信息,即RGB、运动矢量、残差和内部图片组(GOP)结构,而无需对视频进行完全解码。具体来说,我们使用轻量级ConvNets来提取GOP中P帧的特征,并设计了空间通道注意力模块(SCAM)来基于具有双向信息流的压缩信息来细化P帧的特性表示。为了学习用于边界检测的合适表示,我们为每个候选帧构造局部帧包,并使用长短期记忆(LSTM)模块来捕捉时间关系。然后,我们在时域中计算具有组相似性的帧差异。该模块仅在本地窗口内应用,这对于事件边界检测至关重要。最后,基于学习到的特征表示,使用一个简单的分类器来确定视频序列的事件边界。为了弥补注释的模糊性并加快训练过程,我们使用高斯核对地面实况事件边界进行预处理。在Kinetics GEBD和TAPOS数据集上进行的大量实验表明,与以前的端到端方法相比,在以相同速度运行的情况下,所提出的方法实现了相当大的改进。代码可在https://github.com/GX77/LCVSL.
{"title":"Local Compressed Video Stream Learning for Generic Event Boundary Detection","authors":"Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan","doi":"10.1007/s11263-023-01921-8","DOIUrl":"https://doi.org/10.1007/s11263-023-01921-8","url":null,"abstract":"<p>Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, <i>i.e.</i>, RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at https://github.com/GX77/LCVSL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 34","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1