2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

UTM: A Unified Multiple Object Tracking Model with Identity-Aware Feature Enhancement UTM:具有身份感知特征增强的统一多目标跟踪模型

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.02095

Sisi You, Hantao Yao, Bingkun Bao, Changsheng Xu

Recently, Multiple Object Tracking has achieved great success, which consists of object detection, feature embedding, and identity association. Existing methods apply the three-step or two-step paradigm to generate robust trajectories, where identity association is independent of other components. However, the independent identity association results in the identity-aware knowledge contained in the tracklet not be used to boost the detection and embedding modules. To overcome the limitations of existing methods, we introduce a novel Unified Tracking Model (UTM) to bridge those three components for generating a positive feedback loop with mutual benefits. The key insight of UTM is the Identity-Aware Feature Enhancement (IAFE), which is applied to bridge and benefit these three components by utilizing the identity-aware knowledge to boost detection and embedding. Formally, IAFE contains the Identity-Aware Boosting Attention (IABA) and the Identity-Aware Erasing Attention (IAEA), where IABA enhances the consistent regions between the current frame feature and identity-aware knowledge, and IAEA suppresses the distracted regions in the current frame feature. With better detections and embeddings, higher-quality tracklets can also be generated. Extensive experiments of public and private detections on three benchmarks demonstrate the robustness of UTM.

近年来，多目标跟踪技术取得了巨大的成功，该技术包括目标检测、特征嵌入和身份关联。现有方法采用三步或两步范式来生成健壮的轨迹，其中身份关联独立于其他组件。然而，独立的身份关联导致tracklet中包含的身份感知知识不能用于增强检测和嵌入模块。为了克服现有方法的局限性，我们引入了一种新的统一跟踪模型(UTM)来桥接这三个组件，以产生一个互惠的正反馈回路。UTM的关键洞察是身份感知特征增强(IAFE)，它通过利用身份感知知识来增强检测和嵌入，用于桥接并使这三个组件受益。形式上，IAFE包含身份感知增强注意(Identity-Aware Boosting Attention, IABA)和身份感知消除注意(Identity-Aware erase Attention, IAEA)， IABA增强当前框架特征与身份感知知识之间的一致区域，IAEA抑制当前框架特征中的分散区域。通过更好的检测和嵌入，还可以生成更高质量的轨道。在三个基准上进行的公共和私人检测的大量实验证明了UTM的鲁棒性。

{"title":"UTM: A Unified Multiple Object Tracking Model with Identity-Aware Feature Enhancement","authors":"Sisi You, Hantao Yao, Bingkun Bao, Changsheng Xu","doi":"10.1109/CVPR52729.2023.02095","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.02095","url":null,"abstract":"Recently, Multiple Object Tracking has achieved great success, which consists of object detection, feature embedding, and identity association. Existing methods apply the three-step or two-step paradigm to generate robust trajectories, where identity association is independent of other components. However, the independent identity association results in the identity-aware knowledge contained in the tracklet not be used to boost the detection and embedding modules. To overcome the limitations of existing methods, we introduce a novel Unified Tracking Model (UTM) to bridge those three components for generating a positive feedback loop with mutual benefits. The key insight of UTM is the Identity-Aware Feature Enhancement (IAFE), which is applied to bridge and benefit these three components by utilizing the identity-aware knowledge to boost detection and embedding. Formally, IAFE contains the Identity-Aware Boosting Attention (IABA) and the Identity-Aware Erasing Attention (IAEA), where IABA enhances the consistent regions between the current frame feature and identity-aware knowledge, and IAEA suppresses the distracted regions in the current frame feature. With better detections and embeddings, higher-quality tracklets can also be generated. Extensive experiments of public and private detections on three benchmarks demonstrate the robustness of UTM.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116703548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Robust Single Image Reflection Removal Against Adversarial Attacks 针对对抗性攻击的鲁棒单图像反射去除

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.02365

Zhenbo Song, Zhenyuan Zhang, Kaihao Zhang, Wenhan Luo, Jason Zhaoxin Fan, Wenqi Ren, Jianfeng Lu

This paper addresses the problem of robust deep single-image reflection removal (SIRR) against adversarial attacks. Current deep learning based SIRR methods have shown significant performance degradation due to unnoticeable distortions and perturbations on input images. For a comprehensive robustness study, we first conduct diverse adversarial attacks specifically for the SIRR problem, i.e. towards different attacking targets and regions. Then we propose a robust SIRR model, which integrates the cross-scale attention module, the multi-scale fusion module, and the adversarial image discriminator. By exploiting the multi-scale mechanism, the model narrows the gap between features from clean and adversarial images. The image discriminator adaptively distinguishes clean or noisy inputs, and thus further gains reliable robustness. Extensive experiments on Nature, SIR2, and Real datasets demonstrate that our model remarkably improves the robustness of SIRR across disparate scenes.

本文研究了对抗攻击的鲁棒深度单图像反射去除(SIRR)问题。目前基于深度学习的SIRR方法由于输入图像上不明显的扭曲和扰动而显示出显著的性能下降。为了进行全面的鲁棒性研究，我们首先针对SIRR问题进行了不同的对抗性攻击，即针对不同的攻击目标和区域。然后，我们提出了一个鲁棒的SIRR模型，该模型集成了跨尺度注意模块、多尺度融合模块和对抗图像鉴别器。通过利用多尺度机制，该模型缩小了干净图像和对抗图像之间的特征差距。图像鉴别器自适应区分干净或有噪声的输入，从而进一步获得可靠的鲁棒性。在Nature、SIR2和Real数据集上进行的大量实验表明，我们的模型显著提高了不同场景下SIRR的鲁棒性。

引用次数: 4

Multiclass Confidence and Localization Calibration for Object Detection 目标检测的多类置信度和定位校准

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.01890

Bimsara Pathiraja, Malitha Gunawardhana, M. H. Khan

Albeit achieving high predictive accuracy across many challenging computer vision problems, recent studies suggest that deep neural networks (DNNs) tend to make over-confident predictions, rendering them poorly calibrated. Most of the existing attempts for improving DNN calibration are limited to classification tasks and restricted to calibrating in-domain predictions. Surprisingly, very little to no attempts have been made in studying the calibration of object detection methods, which occupy a pivotal space in vision-based security-sensitive, and safety-critical applications. In this paper, we propose a new train-time technique for calibrating modern object detection methods. It is capable of jointly calibrating multiclass confidence and box localization by leveraging their predictive uncertainties. We perform extensive experiments on several in-domain and out-of-domain detection benchmarks. Results demonstrate that our proposed train-time calibration method consistently outperforms several baselines in reducing calibration error for both in-domain and out-of-domain predictions. Our code and models are available at https://github.com/bimsarapathiraja/MCCL

尽管在许多具有挑战性的计算机视觉问题上取得了很高的预测准确性，但最近的研究表明，深度神经网络(dnn)往往会做出过于自信的预测，从而导致它们校准不当。现有的大多数改进深度神经网络校准的尝试都局限于分类任务，并且仅限于校准域内预测。令人惊讶的是，很少甚至没有尝试研究物体检测方法的校准，这在基于视觉的安全敏感和安全关键应用中占据关键空间。在本文中，我们提出了一种新的训练时间技术来校准现代目标检测方法。它能够利用多类置信度和盒定位的预测不确定性来联合校准它们。我们在几个域内和域外检测基准上进行了广泛的实验。结果表明，我们提出的训练时间校准方法在减少域内和域外预测的校准误差方面始终优于几种基线。我们的代码和模型可在https://github.com/bimsarapathiraja/MCCL上获得

{"title":"Multiclass Confidence and Localization Calibration for Object Detection","authors":"Bimsara Pathiraja, Malitha Gunawardhana, M. H. Khan","doi":"10.1109/CVPR52729.2023.01890","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.01890","url":null,"abstract":"Albeit achieving high predictive accuracy across many challenging computer vision problems, recent studies suggest that deep neural networks (DNNs) tend to make over-confident predictions, rendering them poorly calibrated. Most of the existing attempts for improving DNN calibration are limited to classification tasks and restricted to calibrating in-domain predictions. Surprisingly, very little to no attempts have been made in studying the calibration of object detection methods, which occupy a pivotal space in vision-based security-sensitive, and safety-critical applications. In this paper, we propose a new train-time technique for calibrating modern object detection methods. It is capable of jointly calibrating multiclass confidence and box localization by leveraging their predictive uncertainties. We perform extensive experiments on several in-domain and out-of-domain detection benchmarks. Results demonstrate that our proposed train-time calibration method consistently outperforms several baselines in reducing calibration error for both in-domain and out-of-domain predictions. Our code and models are available at https://github.com/bimsarapathiraja/MCCL","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"59 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120921031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Regeneration 修正:基于视觉输入的自监督语音合成用于通用和广义语音再生

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.01802

Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi

Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Regeneration, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech while not necessarily preserving the rest such as voice. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual regeneration tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE/.

先前关于提高视觉输入语音质量的工作通常是分别研究每种类型的听觉失真(例如，分离、图像绘制、视频到语音)，并提出量身定制的算法。本文建议将这些学科统一起来，研究广义语音再生，其目标不是重建精确的参考干净信号，而是专注于改善语音的某些方面，而不一定保留语音等其他方面。本文特别关注清晰度、质量和视频同步。我们将该问题描述为视听语音再合成，它由两个步骤组成:伪视听语音识别(P-AVSR)和伪文本到语音合成(P-TTS)。P-AVSR和P-TTS由自监督语音模型派生的离散单元连接。此外，我们利用自监督视听语音模型来初始化P-AVSR。提出的模型被称为revision。revision是第一个用于野外视频到语音合成的高质量模型，并在所有LRS3视听再生任务上实现了卓越的性能。为了证明其在现实世界中的适用性，我们还对EasyCom进行了评估，EasyCom是在具有挑战性的声学条件下收集的视听基准，只有1.6小时的训练数据。同样，revision也能极大地抑制噪音，提高质量。项目页面:https://wnhsu.github.io/ReVISE/。

{"title":"ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Regeneration","authors":"Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi","doi":"10.1109/CVPR52729.2023.01802","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.01802","url":null,"abstract":"Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Regeneration, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech while not necessarily preserving the rest such as voice. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual regeneration tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE/.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121030497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

ASPnet: Action Segmentation with Shared-Private Representation of Multiple Data Sources 使用多数据源的共享-私有表示的动作分割

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.00236

Beatrice van Amsterdam, A. Kadkhodamohammadi, Imanol Luengo, D. Stoyanov

Most state-of-the-art methods for action segmentation are based on single input modalities or naïve fusion of multiple data sources. However, effective fusion of complementary information can potentially strengthen segmentation models and make them more robust to sensor noise and more accurate with smaller training datasets. In order to improve multimodal representation learning for action segmentation, we propose to disentangle hidden features of a multi-stream segmentation model into modality-shared components, containing common information across data sources, and private components; we then use an attention bottleneck to capture long-range temporal dependencies in the data while preserving disentanglement in consecutive processing layers. Evaluation on 50salads, Breakfast and RARP45 datasets shows that our multimodal approach outperforms different data fusion baselines on both multiview and multimodal data sources, obtaining competitive or better results compared with the state-of-the-art. Our model is also more robust to additive sensor noise and can achieve performance on par with strong video baselines even with less training data.

大多数最先进的动作分割方法是基于单一输入方式或naïve多个数据源的融合。然而，互补信息的有效融合可以潜在地增强分割模型，使其对传感器噪声更具鲁棒性，并且在更小的训练数据集上更准确。为了改进动作分割的多模态表示学习，我们提出将多流分割模型的隐藏特征分解为模态共享组件(包含跨数据源的公共信息)和私有组件;然后，我们使用注意力瓶颈来捕获数据中的长期时间依赖性，同时保持连续处理层的解纠缠性。对50个沙拉、早餐和RARP45数据集的评估表明，我们的多模式方法在多视图和多模式数据源上都优于不同的数据融合基线，与最先进的方法相比，获得了具有竞争力或更好的结果。我们的模型对附加传感器噪声的鲁棒性也更强，即使训练数据更少，也能达到与强视频基线相当的性能。

{"title":"ASPnet: Action Segmentation with Shared-Private Representation of Multiple Data Sources","authors":"Beatrice van Amsterdam, A. Kadkhodamohammadi, Imanol Luengo, D. Stoyanov","doi":"10.1109/CVPR52729.2023.00236","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.00236","url":null,"abstract":"Most state-of-the-art methods for action segmentation are based on single input modalities or naïve fusion of multiple data sources. However, effective fusion of complementary information can potentially strengthen segmentation models and make them more robust to sensor noise and more accurate with smaller training datasets. In order to improve multimodal representation learning for action segmentation, we propose to disentangle hidden features of a multi-stream segmentation model into modality-shared components, containing common information across data sources, and private components; we then use an attention bottleneck to capture long-range temporal dependencies in the data while preserving disentanglement in consecutive processing layers. Evaluation on 50salads, Breakfast and RARP45 datasets shows that our multimodal approach outperforms different data fusion baselines on both multiview and multimodal data sources, obtaining competitive or better results compared with the state-of-the-art. Our model is also more robust to additive sensor noise and can achieve performance on par with strong video baselines even with less training data.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121093587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation with Implicit Neural Representations 基于隐式神经表征的连续伪标签纠偏领域自适应语义分割

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.00698

R. Gong, Qin Wang, Martin Danelljan, Dengxin Dai, L. Gool

Unsupervised domain adaptation (UDA) for semantic segmentation aims at improving the model performance on the unlabeled target domain by leveraging a labeled source domain. Existing approaches have achieved impressive progress by utilizing pseudo-labels on the unlabeled target-domain images. Yet the low-quality pseudo-labels, arising from the domain discrepancy, inevitably hinder the adaptation. This calls for effective and accurate approaches to estimating the reliability of the pseudo-labels, in order to rectify them. In this paper, we propose to estimate the rectification values of the predicted pseudo-labels with implicit neural representations. We view the rectification value as a signal defined over the continuous spatial domain. Taking an image coordinate and the nearby deep features as inputs, the rectification value at a given coordinate is predicted as an output. This allows us to achieve high-resolution and detailed rectification values estimation, important for accurate pseudo-label generation at mask boundaries in particular. The rectified pseudo-labels are then leveraged in our rectification-aware mixture model (RMM) to be learned end-to-end and help the adaptation. We demonstrate the effectiveness of our approach on different UDA benchmarks, including synthetic-to-real and day-to-night. Our approach achieves superior results compared to state-of-the-art. The implementation is available at https://github.com/ETHRuiGong/IR2F.

用于语义分割的无监督域自适应(UDA)旨在利用标记的源域来提高模型在未标记的目标域上的性能。现有的方法通过在未标记的目标域图像上使用伪标签取得了令人印象深刻的进展。然而，由于领域差异而产生的低质量伪标签不可避免地阻碍了适应。这需要有效和准确的方法来估计伪标签的可靠性，以纠正它们。在本文中，我们提出用隐式神经表示估计预测伪标签的校正值。我们把整流值看作是在连续空间域中定义的信号。以图像坐标和附近的深层特征作为输入，预测给定坐标处的整流值作为输出。这使我们能够实现高分辨率和详细的整流值估计，这对于在掩码边界准确生成伪标签尤其重要。然后在我们的校正感知混合模型(RMM)中利用校正后的伪标签进行端到端学习并帮助自适应。我们在不同的UDA基准上展示了我们的方法的有效性，包括从合成到真实和从早到晚。与最先进的方法相比，我们的方法取得了更好的效果。该实现可从https://github.com/ETHRuiGong/IR2F获得。

{"title":"Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation with Implicit Neural Representations","authors":"R. Gong, Qin Wang, Martin Danelljan, Dengxin Dai, L. Gool","doi":"10.1109/CVPR52729.2023.00698","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.00698","url":null,"abstract":"Unsupervised domain adaptation (UDA) for semantic segmentation aims at improving the model performance on the unlabeled target domain by leveraging a labeled source domain. Existing approaches have achieved impressive progress by utilizing pseudo-labels on the unlabeled target-domain images. Yet the low-quality pseudo-labels, arising from the domain discrepancy, inevitably hinder the adaptation. This calls for effective and accurate approaches to estimating the reliability of the pseudo-labels, in order to rectify them. In this paper, we propose to estimate the rectification values of the predicted pseudo-labels with implicit neural representations. We view the rectification value as a signal defined over the continuous spatial domain. Taking an image coordinate and the nearby deep features as inputs, the rectification value at a given coordinate is predicted as an output. This allows us to achieve high-resolution and detailed rectification values estimation, important for accurate pseudo-label generation at mask boundaries in particular. The rectified pseudo-labels are then leveraged in our rectification-aware mixture model (RMM) to be learned end-to-end and help the adaptation. We demonstrate the effectiveness of our approach on different UDA benchmarks, including synthetic-to-real and day-to-night. Our approach achieves superior results compared to state-of-the-art. The implementation is available at https://github.com/ETHRuiGong/IR2F.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121222922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Few-Shot Referring Relationships in Videos 视频中的少量引用关系

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.00227

Yogesh Kumar, Anand Mishra

Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame-level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy.

解释视觉关系是全面理解视频的一个核心方面。给定一个查询视觉关系和一个测试视频，我们的目标是定位通过谓词连接的主题和对象。考虑到现代视觉语言理解能力，只要有大规模的带注释的训练样例可用，解决这个问题是可以实现的。然而，对主语、宾语和谓语的每一个组合进行注释既麻烦又昂贵，而且可能不可行。因此，需要一种模型，它可以学习在空间和时间上定位通过看不见的谓词连接的主题和对象，只使用少数共享公共谓词的支持集视频。我们首次解决了这个具有挑战性的问题，即视频中的少镜头引用关系。为了达到这个目的，我们把这个问题看作是在t部随机场上定义的目标函数的最小化。这里，随机场的顶点对应于主体和客体的候选边界框，T表示测试视频中的帧数。该目标函数由帧级和视觉关系相似势组成。为了学习这些潜力，我们使用了一个关系网络，该网络将查询条件的翻译关系嵌入作为输入，并以情景方式使用支持集视频进行元训练。进一步，利用基于信念传播的随机场信息最小化目标函数，获得主客体轨迹的时空定位。我们使用两个公共基准(即ImageNet-VidVRD和VidOR)进行了广泛的实验，并将所提出的方法与竞争性基准进行比较，以评估其有效性。

{"title":"Few-Shot Referring Relationships in Videos","authors":"Yogesh Kumar, Anand Mishra","doi":"10.1109/CVPR52729.2023.00227","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.00227","url":null,"abstract":"Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame-level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121332386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weak-shot Object Detection through Mutual Knowledge Transfer 基于相互知识转移的弱射目标检测

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.01884

Xuanyi Du, Weitao Wan, Chong Sun, Chen Li

Weak-shot Object Detection methods exploit a fully-annotated source dataset to facilitate the detection performance on the target dataset which only contains image-level labels for novel categories. To bridge the gap between these two datasets, we aim to transfer the object knowledge between the source (S) and target (T) datasets in a bi-directional manner. We propose a novel Knowledge Transfer (KT) loss which simultaneously distills the knowledge of objectness and class entropy from a proposal generator trained on the S dataset to optimize a multiple instance learning module on the T dataset. By jointly optimizing the classification loss and the proposed KT loss, the multiple instance learning module effectively learns to classify object proposals into novel categories in the T dataset with the transferred knowledge from base categories in the S dataset. Noticing the predicted boxes on the T dataset can be regarded as an extension for the original annotations on the S dataset to refine the proposal generator in return, we further propose a novel Consistency Filtering (CF) method to reliably remove inaccurate pseudo labels by evaluating the stability of the multiple instance learning module upon noise injections. Via mutually transferring knowledge between the S and T datasets in an iterative manner, the detection performance on the target dataset is significantly improved. Extensive experiments on public benchmarks validate that the proposed method performs favourably against the state-of-the-art methods without increasing the model parameters or inference computational complexity.

弱射目标检测方法利用完全注释的源数据集来提高目标数据集的检测性能，目标数据集仅包含图像级别的新类别标签。为了弥合这两个数据集之间的差距，我们的目标是以双向方式在源(S)和目标(T)数据集之间传输对象知识。我们提出了一种新的知识转移(KT)损失，它同时从S数据集上训练的建议生成器中提取对象和类熵的知识，以优化T数据集上的多实例学习模块。通过对分类损失和提出的KT损失进行联合优化，多实例学习模块利用S数据集中基本类别的迁移知识，有效地学习将T数据集中的对象建议分类为新的类别。注意到T数据集上的预测框可以被视为S数据集上原始注释的扩展，以改进提议生成器，我们进一步提出了一种新的一致性过滤(CF)方法，通过评估多实例学习模块在噪声注入时的稳定性来可靠地去除不准确的伪标签。通过在S和T数据集之间以迭代方式相互传递知识，显著提高了目标数据集上的检测性能。在公共基准上进行的大量实验验证了所提出的方法在不增加模型参数或推理计算复杂性的情况下优于最先进的方法。

{"title":"Weak-shot Object Detection through Mutual Knowledge Transfer","authors":"Xuanyi Du, Weitao Wan, Chong Sun, Chen Li","doi":"10.1109/CVPR52729.2023.01884","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.01884","url":null,"abstract":"Weak-shot Object Detection methods exploit a fully-annotated source dataset to facilitate the detection performance on the target dataset which only contains image-level labels for novel categories. To bridge the gap between these two datasets, we aim to transfer the object knowledge between the source (S) and target (T) datasets in a bi-directional manner. We propose a novel Knowledge Transfer (KT) loss which simultaneously distills the knowledge of objectness and class entropy from a proposal generator trained on the S dataset to optimize a multiple instance learning module on the T dataset. By jointly optimizing the classification loss and the proposed KT loss, the multiple instance learning module effectively learns to classify object proposals into novel categories in the T dataset with the transferred knowledge from base categories in the S dataset. Noticing the predicted boxes on the T dataset can be regarded as an extension for the original annotations on the S dataset to refine the proposal generator in return, we further propose a novel Consistency Filtering (CF) method to reliably remove inaccurate pseudo labels by evaluating the stability of the multiple instance learning module upon noise injections. Via mutually transferring knowledge between the S and T datasets in an iterative manner, the detection performance on the target dataset is significantly improved. Extensive experiments on public benchmarks validate that the proposed method performs favourably against the state-of-the-art methods without increasing the model parameters or inference computational complexity.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127131395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DIP: Dual Incongruity Perceiving Network for Sarcasm Detection 双重不一致感知网络的讽刺检测

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.00250

C. Wen, Guoli Jia, Jufeng Yang

Sarcasm indicates the literal meaning is contrary to the real attitude. Considering the popularity and complementarity of image-text data, we investigate the task of multi-modal sarcasm detection. Different from other multi-modal tasks, for the sarcastic data, there exists intrinsic incongruity between a pair of image and text as demonstrated in psychological theories. To tackle this issue, we propose a Dual Incongruity Perceiving (DIP) network consisting of two branches to mine the sarcastic information from factual and affective levels. For the factual aspect, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings, and leverage gaussian distribution to model the uncertain correlation caused by the incongruity. The distribution is generated from the latest data stored in the memory bank, which can adaptively model the difference of semantic similarity between sarcastic and non-sarcastic data. For the affective aspect, we utilize siamese layers with shared parameters to learn cross-modal sentiment information. Furthermore, we use the polarity value to construct a relation graph for the mini-batch, which forms the continuous contrastive loss to acquire affective embeddings. Extensive experiments demonstrate that our proposed method performs favorably against state-of-the-art approaches. Our code is released on https://github.com/downdric/MSD.

讽刺是指字面意思与真实态度相反。考虑到图像-文本数据的普及和互补性，我们研究了多模态讽刺检测的任务。与其他多模态任务不同的是，对于讽刺数据，心理学理论证明了一对图像和文本之间存在着内在的不协调。为了解决这一问题，我们提出了一个由两个分支组成的双重不一致感知网络，从事实和情感层面挖掘讽刺信息。在事实方面，我们引入了一种通道加权策略来获得语义上的判别嵌入，并利用高斯分布来建模不一致引起的不确定相关性。该分布是由存储在记忆库中的最新数据生成的，它可以自适应地模拟讽刺和非讽刺数据之间的语义相似度差异。在情感方面，我们利用具有共享参数的连体层来学习跨模态情感信息。在此基础上，利用极性值构造小批量的关系图，形成连续的对比损失来获取情感嵌入。大量的实验表明，我们提出的方法优于最先进的方法。我们的代码发布在https://github.com/downdric/MSD上。

{"title":"DIP: Dual Incongruity Perceiving Network for Sarcasm Detection","authors":"C. Wen, Guoli Jia, Jufeng Yang","doi":"10.1109/CVPR52729.2023.00250","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.00250","url":null,"abstract":"Sarcasm indicates the literal meaning is contrary to the real attitude. Considering the popularity and complementarity of image-text data, we investigate the task of multi-modal sarcasm detection. Different from other multi-modal tasks, for the sarcastic data, there exists intrinsic incongruity between a pair of image and text as demonstrated in psychological theories. To tackle this issue, we propose a Dual Incongruity Perceiving (DIP) network consisting of two branches to mine the sarcastic information from factual and affective levels. For the factual aspect, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings, and leverage gaussian distribution to model the uncertain correlation caused by the incongruity. The distribution is generated from the latest data stored in the memory bank, which can adaptively model the difference of semantic similarity between sarcastic and non-sarcastic data. For the affective aspect, we utilize siamese layers with shared parameters to learn cross-modal sentiment information. Furthermore, we use the polarity value to construct a relation graph for the mini-batch, which forms the continuous contrastive loss to acquire affective embeddings. Extensive experiments demonstrate that our proposed method performs favorably against state-of-the-art approaches. Our code is released on https://github.com/downdric/MSD.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127148329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Erudite Fine-Grained Visual Classification Model 一个博学的细粒度视觉分类模型

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2023-06-01 DOI: 10.1109/CVPR52729.2023.00702

Dongliang Chang, Yujun Tong, Ruoyi Du, Timothy M. Hospedales, Yi-Zhe Song, Zhanyu Ma

Current fine-grained visual classification (FGVC) models are isolated. In practice, we first need to identify the coarse-grained label of an object, then select the corresponding FGVC model for recognition. This hinders the application of FGVC algorithms in real-life scenarios. In this paper, we propose an erudite FGVC model jointly trained by several different datasets11In this paper, different datasets mean different fine-grained visual classification datasets., which can efficiently and accurately predict an object's fine-grained label across the combined label space. We found through a pilot study that positive and negative transfers co-occur when different datasets are mixed for training, i.e., the knowledge from other datasets is not always useful. Therefore, we first propose a feature disentanglement module and a feature re-fusion module to reduce negative transfer and boost positive transfer between different datasets. In detail, we reduce negative transfer by decoupling the deep features through many dataset-specific feature extractors. Subsequently, these are channel-wise re-fused to facilitate positive transfer. Finally, we propose a meta-learning based dataset-agnostic spatial attention layer to take full advantage of the multi-dataset training data, given that localisation is dataset-agnostic between different datasets. Experimental results across 11 different mixed-datasets built on four different FGVC datasets demonstrate the effectiveness of the proposed method. Furthermore, the proposed method can be easily combined with existing FGVC methods to obtain state-of-the-art results. Our code is available at https://github.com/PRIS-CV/An-Erudite-FGVC-Model.

目前的细粒度视觉分类(FGVC)模型是孤立的。在实践中，我们首先需要识别对象的粗粒度标签，然后选择相应的FGVC模型进行识别。这阻碍了FGVC算法在现实生活中的应用。在本文中，我们提出了一个由多个不同数据集联合训练的博学的FGVC模型。在本文中，不同的数据集意味着不同的细粒度视觉分类数据集。，它可以在组合的标签空间中高效、准确地预测对象的细粒度标签。我们通过一项试点研究发现，当不同的数据集混合进行训练时，正迁移和负迁移同时发生，即来自其他数据集的知识并不总是有用的。因此，我们首先提出了特征解纠缠模块和特征再融合模块，以减少不同数据集之间的负迁移和促进正迁移。详细地说，我们通过许多特定于数据集的特征提取器来解耦深度特征，从而减少负迁移。随后，这些都是通道明智的拒绝，以促进正转移。最后，我们提出了一个基于元学习的与数据集无关的空间注意层，以充分利用多数据集训练数据，因为不同数据集之间的定位是与数据集无关的。在4个不同的FGVC数据集上建立的11个不同混合数据集的实验结果证明了该方法的有效性。此外，所提出的方法可以很容易地与现有的FGVC方法相结合，以获得最先进的结果。我们的代码可在https://github.com/PRIS-CV/An-Erudite-FGVC-Model上获得。

{"title":"An Erudite Fine-Grained Visual Classification Model","authors":"Dongliang Chang, Yujun Tong, Ruoyi Du, Timothy M. Hospedales, Yi-Zhe Song, Zhanyu Ma","doi":"10.1109/CVPR52729.2023.00702","DOIUrl":"https://doi.org/10.1109/CVPR52729.2023.00702","url":null,"abstract":"Current fine-grained visual classification (FGVC) models are isolated. In practice, we first need to identify the coarse-grained label of an object, then select the corresponding FGVC model for recognition. This hinders the application of FGVC algorithms in real-life scenarios. In this paper, we propose an erudite FGVC model jointly trained by several different datasets11In this paper, different datasets mean different fine-grained visual classification datasets., which can efficiently and accurately predict an object's fine-grained label across the combined label space. We found through a pilot study that positive and negative transfers co-occur when different datasets are mixed for training, i.e., the knowledge from other datasets is not always useful. Therefore, we first propose a feature disentanglement module and a feature re-fusion module to reduce negative transfer and boost positive transfer between different datasets. In detail, we reduce negative transfer by decoupling the deep features through many dataset-specific feature extractors. Subsequently, these are channel-wise re-fused to facilitate positive transfer. Finally, we propose a meta-learning based dataset-agnostic spatial attention layer to take full advantage of the multi-dataset training data, given that localisation is dataset-agnostic between different datasets. Experimental results across 11 different mixed-datasets built on four different FGVC datasets demonstrate the effectiveness of the proposed method. Furthermore, the proposed method can be easily combined with existing FGVC methods to obtain state-of-the-art results. Our code is available at https://github.com/PRIS-CV/An-Erudite-FGVC-Model.","PeriodicalId":376416,"journal":{"name":"2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127468580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1