Pub Date : 2024-11-18DOI: 10.1007/s11263-024-02301-6
Ahmet Burak Yildirim, Hamza Pehlivan, Aysegul Dundar
StyleGAN models show editing capabilities via their semantically interpretable latent organizations which require successful GAN inversion methods to edit real images. Many works have been proposed for inverting images into StyleGAN’s latent space. However, their results either suffer from low fidelity to the input image or poor editing qualities, especially for edits that require large transformations. That is because low bit rate latent spaces lose many image details due to the information bottleneck even though it provides an editable space. On the other hand, higher bit rate latent spaces can pass all the image details to StyleGAN for perfect reconstruction of images but suffer from low editing qualities. In this work, we present a novel image inversion architecture that extracts high-rate latent features and includes a flow estimation module to warp these features to adapt them to edits. This is because edits often involve spatial changes in the image, such as adjustments to pose or smile. Thus, high-rate latent features must be accurately repositioned to match their new locations in the edited image space. We achieve this by employing flow estimation to determine the necessary spatial adjustments, followed by warping the features to align them correctly in the edited image. Specifically, we estimate the flows from StyleGAN features of edited and unedited latent codes. By estimating the high-rate features and warping them for edits, we achieve both high-fidelity to the input image and high-quality edits. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.
StyleGAN 模型通过其语义可解释的潜在组织显示出编辑能力,这就需要成功的 GAN 反演方法来编辑真实图像。已经有很多人提出了将图像反转到 StyleGAN 潜在空间的方法。但是,它们的结果要么与输入图像的保真度较低,要么编辑质量较差,尤其是对于需要大量变换的编辑。这是因为低比特率潜空间虽然提供了一个可编辑的空间,但由于信息瓶颈而丢失了许多图像细节。另一方面,高比特率潜空间可以将所有图像细节传递给 StyleGAN,从而完美地重建图像,但编辑质量较低。在这项工作中,我们提出了一种新颖的图像反转架构,该架构可提取高比特率潜特征,并包含一个流量估计模块来扭曲这些特征,使其适应编辑。这是因为编辑通常涉及图像的空间变化,如姿势或微笑的调整。因此,必须对高速潜特征进行精确的重新定位,使其与编辑后图像空间中的新位置相匹配。为此,我们采用流量估算来确定必要的空间调整,然后对特征进行扭曲,使其在编辑后的图像中正确对齐。具体来说,我们从已编辑和未编辑潜码的 StyleGAN 特征中估算流量。通过估算高速率特征并对其进行编辑扭曲,我们实现了对输入图像的高保真和高质量编辑。我们进行了大量实验,并将我们的方法与最先进的反转方法进行了比较。定性指标和可视化比较结果表明,我们的方法有了显著的改进。
{"title":"Warping the Residuals for Image Editing with StyleGAN","authors":"Ahmet Burak Yildirim, Hamza Pehlivan, Aysegul Dundar","doi":"10.1007/s11263-024-02301-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02301-6","url":null,"abstract":"<p>StyleGAN models show editing capabilities via their semantically interpretable latent organizations which require successful GAN inversion methods to edit real images. Many works have been proposed for inverting images into StyleGAN’s latent space. However, their results either suffer from low fidelity to the input image or poor editing qualities, especially for edits that require large transformations. That is because low bit rate latent spaces lose many image details due to the information bottleneck even though it provides an editable space. On the other hand, higher bit rate latent spaces can pass all the image details to StyleGAN for perfect reconstruction of images but suffer from low editing qualities. In this work, we present a novel image inversion architecture that extracts high-rate latent features and includes a flow estimation module to warp these features to adapt them to edits. This is because edits often involve spatial changes in the image, such as adjustments to pose or smile. Thus, high-rate latent features must be accurately repositioned to match their new locations in the edited image space. We achieve this by employing flow estimation to determine the necessary spatial adjustments, followed by warping the features to align them correctly in the edited image. Specifically, we estimate the flows from StyleGAN features of edited and unedited latent codes. By estimating the high-rate features and warping them for edits, we achieve both high-fidelity to the input image and high-quality edits. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142670356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Domain-adaptive semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. However, existing methods primarily focus on directly learning categorically discriminative target features for segmenting target images, which is challenging in the absence of target labels. This work provides a new perspective. We ob serve that the features learned with source data manage to keep categorically discriminative during training, thereby enabling us to implicitly learn adequate target representations by simply pulling target features close to source features for each category. To this end, we propose T2S-DA, which encourages the model to learn similar cross-domain features. Also, considering the pixel categories are heavily imbalanced for segmentation datasets, we come up with a dynamic re-weighting strategy to help the model concentrate on those underperforming classes. Extensive experiments confirm that T2S-DA learns a more discriminative and generalizable representation, significantly surpassing the state-of-the-art. We further show that T2S-DA is quite qualified for the domain generalization task, verifying its domain-invariant property.
{"title":"Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation","authors":"Haochen Wang, Yujun Shen, Jingjing Fei, Wei Li, Liwei Wu, Yuxi Wang, Zhaoxiang Zhang","doi":"10.1007/s11263-024-02285-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02285-3","url":null,"abstract":"<p>Domain-adaptive semantic segmentation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. However, existing methods primarily focus on directly learning categorically discriminative target features for segmenting target images, which is challenging in the absence of target labels. This work provides a new perspective. We ob serve that the features learned with source data manage to keep categorically discriminative during training, thereby enabling us to implicitly learn adequate target representations by simply <i>pulling target features close to source features for each category</i>. To this end, we propose T2S-DA, which encourages the model to learn similar cross-domain features. Also, considering the pixel categories are heavily imbalanced for segmentation datasets, we come up with a dynamic re-weighting strategy to help the model concentrate on those underperforming classes. Extensive experiments confirm that T2S-DA learns a more discriminative and generalizable representation, significantly surpassing the state-of-the-art. We further show that T2S-DA is quite qualified for the domain generalization task, verifying its domain-invariant property.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"99 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142642626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-15DOI: 10.1007/s11263-024-02291-5
Yifan Lu, Jiayi Ma
This paper studies graph clustering with application to feature matching and proposes an effective method, termed as GC-LAC, that can establish reliable feature correspondences and simultaneously discover all potential visual patterns. In particular, we regard each putative match as a node and encode the geometric relationships into edges where a visual pattern sharing similar motion behaviors corresponds to a strongly connected subgraph. In this setting, it is natural to formulate the feature matching task as a graph clustering problem. To construct a geometric meaningful graph, based on the best practices, we adopt a local affine strategy. By investigating the motion coherence prior, we further propose an efficient and deterministic geometric solver (MCDG) to extract the local geometric information that helps construct the graph. The graph is sparse and general for various image transformations. Subsequently, a novel robust graph clustering algorithm (D2SCAN) is introduced, which defines the notion of density-reachable on the graph by replicator dynamics optimization. Extensive experiments focusing on both the local and the whole of our GC-LAC with various practical vision tasks including relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multimodel fitting, demonstrate that our GC-LAC is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness. The source code for this work is publicly available at: https://github.com/YifanLu2000/GCLAC.
{"title":"Feature Matching via Graph Clustering with Local Affine Consensus","authors":"Yifan Lu, Jiayi Ma","doi":"10.1007/s11263-024-02291-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02291-5","url":null,"abstract":"<p>This paper studies graph clustering with application to feature matching and proposes an effective method, termed as GC-LAC, that can establish reliable feature correspondences and simultaneously discover all potential visual patterns. In particular, we regard each putative match as a node and encode the geometric relationships into edges where a visual pattern sharing similar motion behaviors corresponds to a strongly connected subgraph. In this setting, it is natural to formulate the feature matching task as a graph clustering problem. To construct a geometric meaningful graph, based on the best practices, we adopt a local affine strategy. By investigating the motion coherence prior, we further propose an efficient and deterministic geometric solver (MCDG) to extract the local geometric information that helps construct the graph. The graph is sparse and general for various image transformations. Subsequently, a novel robust graph clustering algorithm (D2SCAN) is introduced, which defines the notion of density-reachable on the graph by replicator dynamics optimization. Extensive experiments focusing on both the local and the whole of our GC-LAC with various practical vision tasks including relative pose estimation, homography and fundamental matrix estimation, loop-closure detection, and multimodel fitting, demonstrate that our GC-LAC is more competitive than current state-of-the-art methods, in terms of generality, efficiency, and effectiveness. The source code for this work is publicly available at: https://github.com/YifanLu2000/GCLAC.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"75 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142637263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Three problems exist in sequential facial image editing: discontinuous editing, inconsistent editing, and irreversible editing. Discontinuous editing is that the current editing can not retain the previously edited attributes. Inconsistent editing is that swapping the attribute editing orders can not yield the same results. Irreversible editing means that operating on a facial image is irreversible, especially in sequential facial image editing. In this work, we put forward three concepts and their corresponding definitions: editing continuity, consistency, and reversibility. Note that continuity refers to the continuity of attributes, that is, attributes can be continuously edited on any face. Consistency is that not only attributes meet continuity, but also facial identity needs to be consistent. To do so, we propose a novel model to achieve the goal of editing continuity, consistency, and reversibility. Furthermore, a sufficient criterion is defined to determine whether a model is continuous, consistent, and reversible. Extensive qualitative and quantitative experimental results validate our proposed model, and show that a continuous, consistent and reversible editing model has a more flexible editing function while preserving facial identity. We believe that our proposed definitions and model will have wide and promising applications in multimedia processing. Code and data are available at https://github.com/mickoluan/CCR.
{"title":"CCR: Facial Image Editing with Continuity, Consistency and Reversibility","authors":"Nan Yang, Xin Luan, Huidi Jia, Zhi Han, Xiaofeng Li, Yandong Tang","doi":"10.1007/s11263-023-01938-z","DOIUrl":"https://doi.org/10.1007/s11263-023-01938-z","url":null,"abstract":"<p>Three problems exist in sequential facial image editing: discontinuous editing, inconsistent editing, and irreversible editing. Discontinuous editing is that the current editing can not retain the previously edited attributes. Inconsistent editing is that swapping the attribute editing orders can not yield the same results. Irreversible editing means that operating on a facial image is irreversible, especially in sequential facial image editing. In this work, we put forward three concepts and their corresponding definitions: editing continuity, consistency, and reversibility. Note that continuity refers to the continuity of attributes, that is, attributes can be continuously edited on any face. Consistency is that not only attributes meet continuity, but also facial identity needs to be consistent. To do so, we propose a novel model to achieve the goal of editing continuity, consistency, and reversibility. Furthermore, a sufficient criterion is defined to determine whether a model is continuous, consistent, and reversible. Extensive qualitative and quantitative experimental results validate our proposed model, and show that a continuous, consistent and reversible editing model has a more flexible editing function while preserving facial identity. We believe that our proposed definitions and model will have wide and promising applications in multimedia processing. Code and data are available at https://github.com/mickoluan/CCR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 5","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"92158488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While action recognition (AR) has gained large improvements with the introduction of large-scale video datasets and the development of deep neural networks, AR models robust to challenging environments in real-world scenarios are still under-explored. We focus on the task of action recognition in dark environments, which can be applied to fields such as surveillance and autonomous driving at night. Intuitively, current deep networks along with visual enhancement techniques should be able to handle AR in dark environments, however, it is observed that this is not always the case in practice. To dive deeper into exploring solutions for AR in dark environments, we launched the (hbox {UG}^{2}{+}) Challenge Track 2 (UG2-2) in IEEE CVPR 2021, with a goal of evaluating and advancing the robustness of AR models in dark environments. The challenge builds and expands on top of a novel ARID dataset, the first dataset for the task of dark video AR, and guides models to tackle such a task in both fully and semi-supervised manners. Baseline results utilizing current AR models and enhancement methods are reported, justifying the challenging nature of this task with substantial room for improvements. Thanks to the active participation from the research community, notable advances have been made in participants’ solutions, while analysis of these solutions helped better identify possible directions to tackle the challenge of AR in dark environments.
{"title":"Going Deeper into Recognizing Actions in Dark Environments: A Comprehensive Benchmark Study","authors":"Yuecong Xu, Haozhi Cao, Jianxiong Yin, Zhenghua Chen, Xiaoli Li, Zhengguo Li, Qianwen Xu, Jianfei Yang","doi":"10.1007/s11263-023-01932-5","DOIUrl":"https://doi.org/10.1007/s11263-023-01932-5","url":null,"abstract":"<p>While action recognition (AR) has gained large improvements with the introduction of large-scale video datasets and the development of deep neural networks, AR models robust to challenging environments in real-world scenarios are still under-explored. We focus on the task of action recognition in dark environments, which can be applied to fields such as surveillance and autonomous driving at night. Intuitively, current deep networks along with visual enhancement techniques should be able to handle AR in dark environments, however, it is observed that this is not always the case in practice. To dive deeper into exploring solutions for AR in dark environments, we launched the <span>(hbox {UG}^{2}{+})</span> Challenge Track 2 (UG2-2) in IEEE CVPR 2021, with a goal of evaluating and advancing the robustness of AR models in dark environments. The challenge builds and expands on top of a novel ARID dataset, the first dataset for the task of dark video AR, and guides models to tackle such a task in both fully and semi-supervised manners. Baseline results utilizing current AR models and enhancement methods are reported, justifying the challenging nature of this task with substantial room for improvements. Thanks to the active participation from the research community, notable advances have been made in participants’ solutions, while analysis of these solutions helped better identify possible directions to tackle the challenge of AR in dark environments.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"57 17","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71516823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-08DOI: 10.1007/s11263-023-01939-y
Weide Liu, Zhonghua Wu, Yang Zhao, Yuming Fang, Chuan-Sheng Foo, Jun Cheng, Guosheng Lin
Current methods for few-shot segmentation (FSSeg) have mainly focused on improving the performance of novel classes while neglecting the performance of base classes. To overcome this limitation, the task of generalized few-shot semantic segmentation (GFSSeg) has been introduced, aiming to predict segmentation masks for both base and novel classes. However, the current prototype-based methods do not explicitly consider the relationship between base and novel classes when updating prototypes, leading to a limited performance in identifying true categories. To address this challenge, we propose a class contrastive loss and a class relationship loss to regulate prototype updates and encourage a large distance between prototypes from different classes, thus distinguishing the classes from each other while maintaining the performance of the base classes. Our proposed approach achieves new state-of-the-art performance for the generalized few-shot segmentation task on PASCAL VOC and MS COCO datasets.
{"title":"Harmonizing Base and Novel Classes: A Class-Contrastive Approach for Generalized Few-Shot Segmentation","authors":"Weide Liu, Zhonghua Wu, Yang Zhao, Yuming Fang, Chuan-Sheng Foo, Jun Cheng, Guosheng Lin","doi":"10.1007/s11263-023-01939-y","DOIUrl":"https://doi.org/10.1007/s11263-023-01939-y","url":null,"abstract":"<p>Current methods for few-shot segmentation (FSSeg) have mainly focused on improving the performance of novel classes while neglecting the performance of base classes. To overcome this limitation, the task of generalized few-shot semantic segmentation (GFSSeg) has been introduced, aiming to predict segmentation masks for both base and novel classes. However, the current prototype-based methods do not explicitly consider the relationship between base and novel classes when updating prototypes, leading to a limited performance in identifying true categories. To address this challenge, we propose a class contrastive loss and a class relationship loss to regulate prototype updates and encourage a large distance between prototypes from different classes, thus distinguishing the classes from each other while maintaining the performance of the base classes. Our proposed approach achieves new state-of-the-art performance for the generalized few-shot segmentation task on PASCAL VOC and MS COCO datasets.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"35 21","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71524145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious second-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.
{"title":"Universal Object Detection with Large Vision Model","authors":"Feng Lin, Wenze Hu, Yaowei Wang, Yonghong Tian, Guangming Lu, Fanglin Chen, Yong Xu, Xiaoyu Wang","doi":"10.1007/s11263-023-01929-0","DOIUrl":"https://doi.org/10.1007/s11263-023-01929-0","url":null,"abstract":"<p>Over the past few years, there has been growing interest in developing a broad, universal, and general-purpose computer vision system. Such systems have the potential to address a wide range of vision tasks simultaneously, without being limited to specific problems or data domains. This universality is crucial for practical, real-world computer vision applications. In this study, our focus is on a specific challenge: the large-scale, multi-domain universal object detection problem, which contributes to the broader goal of achieving a universal vision system. This problem presents several intricate challenges, including cross-dataset category label duplication, label conflicts, and the necessity to handle hierarchical taxonomies. To address these challenges, we introduce our approach to label handling, hierarchy-aware loss design, and resource-efficient model training utilizing a pre-trained large vision model. Our method has demonstrated remarkable performance, securing a prestigious <i>second</i>-place ranking in the object detection track of the Robust Vision Challenge 2022 (RVC 2022) on a million-scale cross-dataset object detection benchmark. We believe that our comprehensive study will serve as a valuable reference and offer an alternative approach for addressing similar challenges within the computer vision community. The source code for our work is openly available at https://github.com/linfeng93/Large-UniDet.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 33","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-06DOI: 10.1007/s11263-023-01935-2
Yaokun Li, Guang Tan, Chao Gou
Landmark detection under large pose with occlusion has been one of the challenging problems in the field of facial analysis. Recently, many works have predicted pose or occlusion together in the multi-task learning (MTL) paradigm, trying to tap into their dependencies and thus alleviate this issue. However, such implicit dependencies are weakly interpretable and inconsistent with the way humans exploit inter-task coupling relations, i.e., accommodating the induced explicit effects. This is one of the essentials that hinders their performance. To this end, in this paper, we propose a Cascaded Iterative Transformer (CIT) to jointly predict facial landmark, occlusion probability, and pose. The proposed CIT, besides implicitly mining task dependencies in a shared encoder, innovatively employs a cost-effective and portability-friendly strategy to pass the decoders’ predictions as prior knowledge to human-like exploit the coupling-induced effects. Moreover, to the best of our knowledge, no dataset contains all these task annotations simultaneously, so we introduce a new dataset termed MERL-RAV-FLOP based on the MERL-RAV dataset. We conduct extensive experiments on several challenging datasets (300W-LP, AFLW2000-3D, BIWI, COFW, and MERL-RAV-FLOP) and achieve remarkable results. The code and dataset can be accessed in https://github.com/Iron-LYK/CIT.
{"title":"Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose","authors":"Yaokun Li, Guang Tan, Chao Gou","doi":"10.1007/s11263-023-01935-2","DOIUrl":"https://doi.org/10.1007/s11263-023-01935-2","url":null,"abstract":"<p>Landmark detection under large pose with occlusion has been one of the challenging problems in the field of facial analysis. Recently, many works have predicted pose or occlusion together in the multi-task learning (MTL) paradigm, trying to tap into their dependencies and thus alleviate this issue. However, such implicit dependencies are weakly interpretable and inconsistent with the way humans exploit inter-task coupling relations, i.e., accommodating the induced explicit effects. This is one of the essentials that hinders their performance. To this end, in this paper, we propose a Cascaded Iterative Transformer (CIT) to jointly predict facial landmark, occlusion probability, and pose. The proposed CIT, besides implicitly mining task dependencies in a shared encoder, innovatively employs a cost-effective and portability-friendly strategy to pass the decoders’ predictions as prior knowledge to human-like exploit the coupling-induced effects. Moreover, to the best of our knowledge, no dataset contains all these task annotations simultaneously, so we introduce a new dataset termed MERL-RAV-FLOP based on the MERL-RAV dataset. We conduct extensive experiments on several challenging datasets (300W-LP, AFLW2000-3D, BIWI, COFW, and MERL-RAV-FLOP) and achieve remarkable results. The code and dataset can be accessed in https://github.com/Iron-LYK/CIT.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"57 16","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71516824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-11-01DOI: 10.1007/s11263-023-01921-8
Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at https://github.com/GX77/LCVSL.
{"title":"Local Compressed Video Stream Learning for Generic Event Boundary Detection","authors":"Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan","doi":"10.1007/s11263-023-01921-8","DOIUrl":"https://doi.org/10.1007/s11263-023-01921-8","url":null,"abstract":"<p>Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, <i>i.e.</i>, RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at https://github.com/GX77/LCVSL.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 34","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}