Pub Date : 2024-07-30DOI: 10.1007/s11263-024-02195-4
Omkar Thawakar, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan
Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as ‘unknown’ and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at https://github.com/OmkarThawakar/OWVISFormer.
现有的视频实例分割(VIS)方法一般都遵循封闭世界假设,即在推理时只对看到的类别实例进行识别和时空分割。开放世界方案放宽了封闭世界静态学习假设,具体如下:(a) 首先,它区分一组已知类别,并将未知对象标记为 "未知",然后(b) 当相应的语义标签可用时,它逐步学习未知对象的类别。我们提出了第一种名为 OW-VISFormer 的开放世界 VIS 方法,它引入了一种新颖的特征丰富机制和一个时空对象性(STO)模块。基于轻量级辅助网络的特征丰富机制旨在从背景中准确划分像素级(未知)对象,并区分特定类别的已知语义类别。STO 模块通过对比损失来增强前景激活,从而生成实例级伪标签。此外,我们还引入了一个广泛的实验方案来测量 OW-VIS 的特性。在 OW-VIS 设置中,我们的 OW-VISFormer 与可靠的基线相比表现出色。此外,我们还评估了我们在标准全监督 VIS 设置中的贡献,将其集成到最新的 SeqFormer 中,在 Youtube-VIS 2019 val. 集上实现了 1.6% AP 的绝对增益。最后,我们展示了我们的贡献对于开放世界检测(OWOD)设置的通用性,其性能优于文献中现有的最佳 OWOD 方法。代码、模型以及 OW-VIS 拆分可在 https://github.com/OmkarThawakar/OWVISFormer 上获取。
{"title":"Video Instance Segmentation in an Open-World","authors":"Omkar Thawakar, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan","doi":"10.1007/s11263-024-02195-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02195-4","url":null,"abstract":"<p>Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as ‘unknown’ and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at https://github.com/OmkarThawakar/OWVISFormer.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"10 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141857871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-30DOI: 10.1007/s11263-024-02189-2
Yang Shen, Xuhao Sun, Xiu-Shen Wei, Anqi Xu, Lingyan Gao
In this paper, we propose Equiangular Basis Vectors (EBVs) as a novel training paradigm of deep learning for image classification tasks. Differing from prominent training paradigms, e.g., k-way classification layers (mapping the learned representations to the label space) and deep metric learning (quantifying sample similarity), our method generates normalized vector embeddings as "predefined classifiers", which act as the fixed learning targets corresponding to different categories. By minimizing the spherical distance of the embedding of an input between its categorical EBV in training, the predictions can be obtained by identifying the categorical EBV with the smallest distance during inference. More importantly, by directly adding EBVs corresponding to newly added categories of equal status on the basis of existing EBVs, our method exhibits strong scalability to deal with the large increase of training categories in open-environment machine learning. In experiments, we evaluate EBVs on diverse computer vision tasks with large-scale real-world datasets, including classification on ImageNet-1K, object detection on COCO, semantic segmentation on ADE20K, etc. We further collected a dataset consisting of 100,000 categories to validate the superior performance of EBVs when handling a large number of categories. Comprehensive experiments validate both the effectiveness and scalability of our EBVs. Our method won the first place in the 2022 DIGIX Global AI Challenge, code along with all associated logs are open-source and available at https://github.com/aassxun/Equiangular-Basis-Vectors.
{"title":"Equiangular Basis Vectors: A Novel Paradigm for Classification Tasks","authors":"Yang Shen, Xuhao Sun, Xiu-Shen Wei, Anqi Xu, Lingyan Gao","doi":"10.1007/s11263-024-02189-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02189-2","url":null,"abstract":"<p>In this paper, we propose Equiangular Basis Vectors (EBVs) as a novel training paradigm of deep learning for image classification tasks. Differing from prominent training paradigms, e.g., <i>k</i>-way classification layers (mapping the learned representations to the label space) and deep metric learning (quantifying sample similarity), our method generates normalized vector embeddings as \"predefined classifiers\", which act as the fixed learning targets corresponding to different categories. By minimizing the spherical distance of the embedding of an input between its categorical EBV in training, the predictions can be obtained by identifying the categorical EBV with the smallest distance during inference. More importantly, by directly adding EBVs corresponding to newly added categories of equal status on the basis of existing EBVs, our method exhibits strong scalability to deal with the large increase of training categories in open-environment machine learning. In experiments, we evaluate EBVs on diverse computer vision tasks with large-scale real-world datasets, including classification on ImageNet-1K, object detection on COCO, semantic segmentation on ADE20K, etc. We further collected a dataset consisting of 100,000 categories to validate the superior performance of EBVs when handling a large number of categories. Comprehensive experiments validate both the effectiveness and scalability of our EBVs. Our method won the first place in the 2022 DIGIX Global AI Challenge, code along with all associated logs are open-source and available at https://github.com/aassxun/Equiangular-Basis-Vectors.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"45 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141857652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-28DOI: 10.1007/s11263-024-02177-6
Xiao Yang, Longlong Xu, Tianyu Pang, Yinpeng Dong, Yikai Wang, Hang Su, Jun Zhu
Recent research has elucidated the susceptibility of face recognition models to physical adversarial patches, thus provoking security concerns about the deployed face recognition systems. Most existing 2D and 3D physical attacks on face recognition, however, produce adversarial examples using a single-state face image of an attacker. This point-wise attack paradigm tends to yield inferior results when countering numerous complicated states in physical environments, such as diverse pose variations. In this paper, by reassessing the intrinsic relationship between an attacker’s face and its variations, we propose a practical pipeline that simulates complex facial transformations in the physical world through 3D face modeling. This adaptive simulation serves as a digital counterpart of physical faces and empowers us to regulate various facial variations and physical conditions. With this digital simulator, we present the Face3DAdv method to craft 3D adversarial patches, which account for 3D facial transformations and realistic physical variations. Moreover, by optimizing the latent space on 3D modeling and involving importance sampling on various transformations, we demonstrate that Face3DAdv can significantly improve the effectiveness and naturalness of a wide range of physically feasible adversarial patches. Furthermore, the physically 3D-printed adversarial patches by Face3DAdv can achieve an effective evaluation of adversarial robustness on multiple popular commercial services, including four recognition APIs, three anti-spoofing APIs and one automated access control system.
{"title":"Face3DAdv: Exploiting Robust Adversarial 3D Patches on Physical Face Recognition","authors":"Xiao Yang, Longlong Xu, Tianyu Pang, Yinpeng Dong, Yikai Wang, Hang Su, Jun Zhu","doi":"10.1007/s11263-024-02177-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02177-6","url":null,"abstract":"<p>Recent research has elucidated the susceptibility of face recognition models to physical adversarial patches, thus provoking security concerns about the deployed face recognition systems. Most existing 2D and 3D physical attacks on face recognition, however, produce adversarial examples using a single-state face image of an attacker. This point-wise attack paradigm tends to yield inferior results when countering numerous complicated states in physical environments, such as diverse pose variations. In this paper, by reassessing the intrinsic relationship between an attacker’s face and its variations, we propose a practical pipeline that simulates complex facial transformations in the physical world through 3D face modeling. This adaptive simulation serves as a digital counterpart of physical faces and empowers us to regulate various facial variations and physical conditions. With this digital simulator, we present the Face3DAdv method to craft 3D adversarial patches, which account for 3D facial transformations and realistic physical variations. Moreover, by optimizing the latent space on 3D modeling and involving importance sampling on various transformations, we demonstrate that Face3DAdv can significantly improve the effectiveness and naturalness of a wide range of physically feasible adversarial patches. Furthermore, the physically 3D-printed adversarial patches by Face3DAdv can achieve an effective evaluation of adversarial robustness on multiple popular commercial services, including four recognition APIs, three anti-spoofing APIs and one automated access control system.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"183 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141769076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-27DOI: 10.1007/s11263-024-02194-5
Yawei Luo, Ping Liu, Yi Yang
Deep models are notoriously known to perform poorly when encountering new domains with different statistics. To alleviate this issue, we present a new domain generalization method based on network pruning, dubbed NPDG. Our core idea is to prune the filters or attention heads that are more sensitive to domain shift while preserving those domain-invariant ones. To this end, we propose a new pruning policy tailored to improve generalization ability, which identifies the filter and head sensibility of domain shift by judging its activation variance among different domains (unary manner) and its correlation to other filters (binary manner). To better reveal those potentially sensitive filters and heads, we present a differentiable style perturbation scheme to imitate the domain variance dynamically. NPDG is trained on a single source domain and can be applied to both CNN- and Transformer-based backbones. To our knowledge, we are among the pioneers in tackling domain generalization in segmentation via network pruning. NPDG not only improves the generalization ability of a segmentation model but also decreases its computation cost. Extensive experiments demonstrate the state-of-the-art generalization performance of NPDG with a lighter-weight structure.
{"title":"Kill Two Birds with One Stone: Domain Generalization for Semantic Segmentation via Network Pruning","authors":"Yawei Luo, Ping Liu, Yi Yang","doi":"10.1007/s11263-024-02194-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02194-5","url":null,"abstract":"<p>Deep models are notoriously known to perform poorly when encountering new domains with different statistics. To alleviate this issue, we present a new domain generalization method based on network pruning, dubbed NPDG. Our core idea is to prune the filters or attention heads that are more sensitive to domain shift while preserving those domain-invariant ones. To this end, we propose a new pruning policy tailored to improve generalization ability, which identifies the filter and head sensibility of domain shift by judging its activation variance among different domains (<i>unary manner</i>) and its correlation to other filters (<i>binary manner</i>). To better reveal those potentially sensitive filters and heads, we present a differentiable style perturbation scheme to imitate the domain variance dynamically. NPDG is trained on a single source domain and can be applied to both CNN- and Transformer-based backbones. To our knowledge, we are among the pioneers in tackling domain generalization in segmentation via network pruning. NPDG not only improves the generalization ability of a segmentation model but also decreases its computation cost. Extensive experiments demonstrate the state-of-the-art generalization performance of NPDG with a lighter-weight structure.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141768610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-27DOI: 10.1007/s11263-024-02186-5
Florinel-Alin Croitoru, Nicolae-Cătălin Ristea, Radu Tudor Ionescu, Nicu Sebe
Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet-1K, Food-101, UTKFace, PASCAL VOC), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121, YOLOv5), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures. We compare our approach with the conventional training regime, as well as with Curriculum by Smoothing (CBS), a state-of-the-art data-agnostic curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC). Our code is freely available at: https://github.com/CroitoruAlin/LeRaC.
{"title":"Learning Rate Curriculum","authors":"Florinel-Alin Croitoru, Nicolae-Cătălin Ristea, Radu Tudor Ionescu, Nicu Sebe","doi":"10.1007/s11263-024-02186-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02186-5","url":null,"abstract":"<p>Most curriculum learning methods require an approach to sort the data samples by difficulty, which is often cumbersome to perform. In this work, we propose a novel curriculum learning approach termed Learning Rate Curriculum (LeRaC), which leverages the use of a different learning rate for each layer of a neural network to create a data-agnostic curriculum during the initial training epochs. More specifically, LeRaC assigns higher learning rates to neural layers closer to the input, gradually decreasing the learning rates as the layers are placed farther away from the input. The learning rates increase at various paces during the first training iterations, until they all reach the same value. From this point on, the neural model is trained as usual. This creates a model-level curriculum learning strategy that does not require sorting the examples by difficulty and is compatible with any neural network, generating higher performance levels regardless of the architecture. We conduct comprehensive experiments on 12 data sets from the computer vision (CIFAR-10, CIFAR-100, Tiny ImageNet, ImageNet-1K, Food-101, UTKFace, PASCAL VOC), language (BoolQ, QNLI, RTE) and audio (ESC-50, CREMA-D) domains, considering various convolutional (ResNet-18, Wide-ResNet-50, DenseNet-121, YOLOv5), recurrent (LSTM) and transformer (CvT, BERT, SepTr) architectures. We compare our approach with the conventional training regime, as well as with Curriculum by Smoothing (CBS), a state-of-the-art data-agnostic curriculum learning approach. Unlike CBS, our performance improvements over the standard training regime are consistent across all data sets and models. Furthermore, we significantly surpass CBS in terms of training time (there is no additional cost over the standard training regime for LeRaC). Our code is freely available at: https://github.com/CroitoruAlin/LeRaC.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"302 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141768609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-27DOI: 10.1007/s11263-024-02162-z
Zeqi Xiao, Wenwei Zhang, Tai Wang, Chen Change Loy, Dahua Lin, Jiangmiao Pang
DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 2.7% and 1.2% PQ on SemanticKITTI and nuScenes datasets, respectively. The source code and models are available at https://github.com/OpenRobotLab/P3Former.
{"title":"Position-Guided Point Cloud Panoptic Segmentation Transformer","authors":"Zeqi Xiao, Wenwei Zhang, Tai Wang, Chen Change Loy, Dahua Lin, Jiangmiao Pang","doi":"10.1007/s11263-024-02162-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02162-z","url":null,"abstract":"<p>DEtection TRansformer (DETR) started a trend that uses a group of learnable queries for unified visual perception. This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline. Although the naive adaptation obtains fair results, the instance segmentation performance is noticeably inferior to previous works. By diving into the details, we observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain. Considering instances in 3D are more featured by their positional information, we emphasize their roles during the modeling and design a robust Mixed-parameterized Positional Embedding (MPE) to guide the segmentation process. It is embedded into backbone features and later guides the mask prediction and query update processes iteratively, leading to Position-Aware Segmentation (PA-Seg) and Masked Focal Attention (MFA). All these designs impel the queries to attend to specific regions and identify various instances. The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 2.7% and 1.2% PQ on SemanticKITTI and nuScenes datasets, respectively. The source code and models are available at https://github.com/OpenRobotLab/P3Former.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141768611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-label ranking, which returns multiple top-ranked labels for each instance, has a wide range of applications for visual tasks. Due to its complicated setting, prior arts have proposed various measures to evaluate model performances. However, both theoretical analysis and empirical observations show that a model might perform inconsistently on different measures. To bridge this gap, this paper proposes a novel measure named Top-K Pairwise Ranking (TKPR), and a series of analyses show that TKPR is compatible with existing ranking-based measures. In light of this, we further establish an empirical surrogate risk minimization framework for TKPR. On one hand, the proposed framework enjoys convex surrogate losses with the theoretical support of Fisher consistency. On the other hand, we establish a sharp generalization bound for the proposed framework based on a novel technique named data-dependent contraction. Finally, empirical results on benchmark datasets validate the effectiveness of the proposed framework.
{"title":"Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-label Classification","authors":"Zitai Wang, Qianqian Xu, Zhiyong Yang, Peisong Wen, Yuan He, Xiaochun Cao, Qingming Huang","doi":"10.1007/s11263-024-02157-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02157-w","url":null,"abstract":"<p>Multi-label ranking, which returns multiple top-ranked labels for each instance, has a wide range of applications for visual tasks. Due to its complicated setting, prior arts have proposed various measures to evaluate model performances. However, both theoretical analysis and empirical observations show that a model might perform inconsistently on different measures. To bridge this gap, this paper proposes a novel measure named Top-K Pairwise Ranking (TKPR), and a series of analyses show that TKPR is compatible with existing ranking-based measures. In light of this, we further establish an empirical surrogate risk minimization framework for TKPR. On one hand, the proposed framework enjoys convex surrogate losses with the theoretical support of Fisher consistency. On the other hand, we establish a sharp generalization bound for the proposed framework based on a novel technique named data-dependent contraction. Finally, empirical results on benchmark datasets validate the effectiveness of the proposed framework.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"52 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141768607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-26DOI: 10.1007/s11263-024-02192-7
Jing Yang, Xiatian Zhu, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
Existing knowledge distillation methods mostly focus on distillation of teacher’s prediction and intermediate activation. However, the structured representation, which arguably is one of the most critical ingredients of deep models, is largely overlooked. In this work, we propose a novel semantic representational distillation (SRD) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. The key idea is that we leverage the teacher’s classifier as a semantic critic for evaluating the representations of both teacher and student and distilling the semantic knowledge with high-order structured information over all feature dimensions. This is accomplished by introducing a notion of cross-network logit computed through passing student’s representation into teacher’s classifier. Further, considering the set of seen classes as a basis for the semantic space in a combinatorial perspective, we scale SRD to unseen classes for enabling effective exploitation of largely available, arbitrary unlabeled training data. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL). Extensive experiments show that our SRD outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks, as well as less studied yet practically crucial binary network distillation. Under more realistic open-set SSL settings we introduce, we reveal that knowledge distillation is generally more effective than existing out-of-distribution sample detection, and our proposed SRD is superior over both previous distillation and SSL competitors. The source code is available at https://github.com/jingyang2017/SRD_ossl.
{"title":"Knowledge Distillation Meets Open-Set Semi-supervised Learning","authors":"Jing Yang, Xiatian Zhu, Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos","doi":"10.1007/s11263-024-02192-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02192-7","url":null,"abstract":"<p>Existing knowledge distillation methods mostly focus on distillation of teacher’s prediction and intermediate activation. However, the structured representation, which arguably is one of the most critical ingredients of deep models, is largely overlooked. In this work, we propose a novel semantic representational distillation (SRD) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. The key idea is that we leverage the teacher’s classifier as a semantic critic for evaluating the representations of both teacher and student and distilling the semantic knowledge with high-order structured information over all feature dimensions. This is accomplished by introducing a notion of cross-network logit computed through passing student’s representation into teacher’s classifier. Further, considering the set of seen classes as a basis for the semantic space in a combinatorial perspective, we scale SRD to unseen classes for enabling effective exploitation of largely available, arbitrary unlabeled training data. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL). Extensive experiments show that our SRD outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks, as well as less studied yet practically crucial binary network distillation. Under more realistic open-set SSL settings we introduce, we reveal that knowledge distillation is generally more effective than existing out-of-distribution sample detection, and our proposed SRD is superior over both previous distillation and SSL competitors. The source code is available at https://github.com/jingyang2017/SRD_ossl.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"41 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141768605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent’s memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: https://github.com/qizhust/esceme.
{"title":"ESceme: Vision-and-Language Navigation with Episodic Scene Memory","authors":"Qi Zheng, Daqing Liu, Chaoyue Wang, Jing Zhang, Dadong Wang, Dacheng Tao","doi":"10.1007/s11263-024-02159-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02159-8","url":null,"abstract":"<p>Vision-and-language navigation (VLN) simulates a visual agent that follows natural-language navigation instructions in real-world scenes. Existing approaches have made enormous progress in navigation in new environments, such as beam search, pre-exploration, and dynamic or hierarchical history encoding. To balance generalization and efficiency, we resort to memorizing visited scenarios apart from the ongoing route while navigating. In this work, we introduce a mechanism of Episodic Scene memory (ESceme) for VLN that wakes an agent’s memories of past visits when it enters the current scene. The episodic scene memory allows the agent to envision a bigger picture of the next prediction. This way, the agent learns to utilize dynamically updated information instead of merely adapting to the current observations. We provide a simple yet effective implementation of ESceme by enhancing the accessible views at each location and progressively completing the memory while navigating. We verify the superiority of ESceme on short-horizon (R2R), long-horizon (R4R), and vision-and-dialog (CVDN) VLN tasks. Our ESceme also wins first place on the CVDN leaderboard. Code is available: https://github.com/qizhust/esceme.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"56 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141768608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s11263-024-02188-3
Jin Zeng, Qingpeng Zhu, Tongxuan Tian, Wenxiu Sun, Lin Zhang, Shengjie Zhao
Depth completion aims to estimate dense depth images from sparse depth measurements with RGB image guidance. However, previous approaches have not fully considered sparse input fidelity, resulting in inconsistency with sparse input and poor robustness to input corruption. In this paper, we propose the deep unrolled Weighted Graph Laplacian Regularization (WGLR) for depth completion which enhances input fidelity and noise robustness by enforcing input constraints in the network design. Specifically, we assume graph Laplacian regularization as the prior for depth completion optimization and derive the WGLR solution by interpreting the depth map as the discrete counterpart of continuous manifold, enabling analysis in continuous domain and enforcing input consistency. Based on its anisotropic diffusion interpretation, we unroll the WGLR solution into iterative filtering for efficient implementation. Furthermore, we integrate the unrolled WGLR into deep learning framework to develop high-performance yet interpretable network, which diffuses the depth in a hierarchical manner to ensure global smoothness while preserving visually salient details. Experimental results demonstrate that the proposed scheme improves consistency with depth measurements and robustness to input corruption for depth completion, outperforming competing schemes on the NYUv2, KITTI-DC and TetrasRGBD datasets.
{"title":"Deep Unrolled Weighted Graph Laplacian Regularization for Depth Completion","authors":"Jin Zeng, Qingpeng Zhu, Tongxuan Tian, Wenxiu Sun, Lin Zhang, Shengjie Zhao","doi":"10.1007/s11263-024-02188-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02188-3","url":null,"abstract":"<p>Depth completion aims to estimate dense depth images from sparse depth measurements with RGB image guidance. However, previous approaches have not fully considered sparse input fidelity, resulting in inconsistency with sparse input and poor robustness to input corruption. In this paper, we propose the deep unrolled Weighted Graph Laplacian Regularization (WGLR) for depth completion which enhances input fidelity and noise robustness by enforcing input constraints in the network design. Specifically, we assume graph Laplacian regularization as the prior for depth completion optimization and derive the WGLR solution by interpreting the depth map as the discrete counterpart of continuous manifold, enabling analysis in continuous domain and enforcing input consistency. Based on its anisotropic diffusion interpretation, we unroll the WGLR solution into iterative filtering for efficient implementation. Furthermore, we integrate the unrolled WGLR into deep learning framework to develop high-performance yet interpretable network, which diffuses the depth in a hierarchical manner to ensure global smoothness while preserving visually salient details. Experimental results demonstrate that the proposed scheme improves consistency with depth measurements and robustness to input corruption for depth completion, outperforming competing schemes on the NYUv2, KITTI-DC and TetrasRGBD datasets.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"21 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141755439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}