首页 > 最新文献

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)最新文献

英文 中文
Burst Reflection Removal using Reflection Motion Aggregation Cues 使用反射运动聚合线索去除突发反射
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00032
B. H. P. Prasad, S. GreenRoshK., R. Lokesh, K. Mitra
Single image reflection removal has attracted lot of interest in the recent past with data driven approaches demonstrating significant improvements. However deep learning based approaches for multi-image reflection removal remains relatively less explored. The existing multi-image methods require input images to be captured at sufficiently different view points with wide baselines. This makes it cumbersome for the user who is required to capture the scene by moving the camera in multiple directions. A more convenient way is to capture a burst of images in a short time duration without providing any specific instructions to the user. A burst of images captured on a hand-held device provide crucial cues that rely on the subtle handshakes created during the capture process to separate the reflection and the transmission layers. In this paper, we propose a multi-stage deep learning based approach for burst reflection removal. In the first stage, we perform reflection suppression on the individual images. In the second stage, a novel reflection motion aggregation (RMA) cue is extracted that emphasizes the transmission layer more than the reflection layer to aid better layer separation. In our final stage we use this RMA cue as a guide to remove reflections from the input. We provide the first real world burst images dataset along with ground truth for reflection removal that can enable future benchmarking. We evaluate both qualitatively and quantitatively to demonstrate the superiority of the proposed approach. Our method achieves ~ 2dB improvement in PSNR over single image based methods and ~ 1dB over multi-image based methods.
最近,数据驱动的方法显示出显著的改进,引起了人们对单幅图像反射去除的极大兴趣。然而,基于深度学习的多图像反射去除方法的探索相对较少。现有的多图像方法要求在足够不同的视点和宽基线上捕获输入图像。这对于需要通过在多个方向移动相机来捕捉场景的用户来说非常麻烦。更方便的方法是在不向用户提供任何具体说明的情况下,在短时间内捕捉一组图像。在手持设备上拍摄的一组图像提供了关键的线索,这些线索依赖于在拍摄过程中产生的微妙的握手来分离反射层和透射层。在本文中,我们提出了一种基于多阶段深度学习的突发反射去除方法。在第一阶段,我们对单个图像进行反射抑制。在第二阶段,提取了一种新的反射运动聚合(RMA)线索,该线索更强调传输层而不是反射层,以帮助更好的层分离。在我们的最后阶段,我们使用这个RMA提示作为从输入中去除反射的指南。我们提供了第一个真实世界的突发图像数据集,以及用于反射去除的地面真相,可以实现未来的基准测试。我们进行了定性和定量评估,以证明所提出方法的优越性。该方法比基于单幅图像的方法提高了约2dB,比基于多幅图像的方法提高了约1dB。
{"title":"Burst Reflection Removal using Reflection Motion Aggregation Cues","authors":"B. H. P. Prasad, S. GreenRoshK., R. Lokesh, K. Mitra","doi":"10.1109/WACV56688.2023.00032","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00032","url":null,"abstract":"Single image reflection removal has attracted lot of interest in the recent past with data driven approaches demonstrating significant improvements. However deep learning based approaches for multi-image reflection removal remains relatively less explored. The existing multi-image methods require input images to be captured at sufficiently different view points with wide baselines. This makes it cumbersome for the user who is required to capture the scene by moving the camera in multiple directions. A more convenient way is to capture a burst of images in a short time duration without providing any specific instructions to the user. A burst of images captured on a hand-held device provide crucial cues that rely on the subtle handshakes created during the capture process to separate the reflection and the transmission layers. In this paper, we propose a multi-stage deep learning based approach for burst reflection removal. In the first stage, we perform reflection suppression on the individual images. In the second stage, a novel reflection motion aggregation (RMA) cue is extracted that emphasizes the transmission layer more than the reflection layer to aid better layer separation. In our final stage we use this RMA cue as a guide to remove reflections from the input. We provide the first real world burst images dataset along with ground truth for reflection removal that can enable future benchmarking. We evaluate both qualitatively and quantitatively to demonstrate the superiority of the proposed approach. Our method achieves ~ 2dB improvement in PSNR over single image based methods and ~ 1dB over multi-image based methods.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"663 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113999053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection 来自音频的互补线索有助于在弱监督对象检测中对抗噪声
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00222
Cagri Gungor, Adriana Kovashka
We tackle the problem of learning object detectors in a noisy environment, which is one of the significant challenges for weakly-supervised learning. We use multimodal learning to help localize objects of interest, but unlike other methods, we treat audio as an auxiliary modality that assists to tackle noise in detection from visual regions. First, we use the audio-visual model to generate new "ground-truth" labels for the training set to remove noise between the visual features and noisy supervision. Second, we propose an "indirect path" between audio and class predictions, which combines the link between visual and audio regions, and the link between visual features and predictions. Third, we propose a sound-based "attention path" which uses the benefit of complementary audio cues to identify important visual regions. We use contrastive learning to perform region-based audio-visual instance discrimination, which serves as an intermediate task and benefits from the complementary cues from audio to boost object classification and detection performance. We show that our methods, which update noisy ground truth and provide indirect and attention paths, greatly boosting performance on the AudioSet and VGGSound datasets compared to single-modality predictions, even ones that use contrastive learning. Our method outperforms previous weakly-supervised detectors for the task of object detection by reaching the state-of-art on AudioSet, and our sound localization module performs better than several state-of-art methods on AudioSet and MUSIC.
我们解决了在噪声环境中学习目标检测器的问题,这是弱监督学习的重大挑战之一。我们使用多模态学习来帮助定位感兴趣的对象,但与其他方法不同的是,我们将音频视为辅助模态,有助于处理视觉区域检测中的噪声。首先,我们使用视听模型为训练集生成新的“ground-truth”标签,以去除视觉特征和噪声监督之间的噪声。其次,我们提出了音频和类预测之间的“间接路径”,它结合了视觉和音频区域之间的联系,以及视觉特征和预测之间的联系。第三,我们提出了一种基于声音的“注意路径”,它利用互补的音频线索来识别重要的视觉区域。我们使用对比学习来执行基于区域的视听实例识别,它作为一个中间任务,利用来自音频的互补线索来提高目标分类和检测性能。我们表明,与单模态预测相比,我们的方法(更新嘈杂的地面真值并提供间接和注意力路径)大大提高了AudioSet和VGGSound数据集的性能,即使是使用对比学习的方法。通过在AudioSet上达到最新水平,我们的方法在对象检测任务上优于以前的弱监督检测器,并且我们的声音定位模块在AudioSet和MUSIC上的性能优于几种最新方法。
{"title":"Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection","authors":"Cagri Gungor, Adriana Kovashka","doi":"10.1109/WACV56688.2023.00222","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00222","url":null,"abstract":"We tackle the problem of learning object detectors in a noisy environment, which is one of the significant challenges for weakly-supervised learning. We use multimodal learning to help localize objects of interest, but unlike other methods, we treat audio as an auxiliary modality that assists to tackle noise in detection from visual regions. First, we use the audio-visual model to generate new \"ground-truth\" labels for the training set to remove noise between the visual features and noisy supervision. Second, we propose an \"indirect path\" between audio and class predictions, which combines the link between visual and audio regions, and the link between visual features and predictions. Third, we propose a sound-based \"attention path\" which uses the benefit of complementary audio cues to identify important visual regions. We use contrastive learning to perform region-based audio-visual instance discrimination, which serves as an intermediate task and benefits from the complementary cues from audio to boost object classification and detection performance. We show that our methods, which update noisy ground truth and provide indirect and attention paths, greatly boosting performance on the AudioSet and VGGSound datasets compared to single-modality predictions, even ones that use contrastive learning. Our method outperforms previous weakly-supervised detectors for the task of object detection by reaching the state-of-art on AudioSet, and our sound localization module performs better than several state-of-art methods on AudioSet and MUSIC.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"51 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114057451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MRI Imputation based on Fused Index- and Intensity-Registration 基于指数与强度融合配准的MRI数据输入
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00199
Jiyoon Shin, Jungwoo Lee
3D MRI imaging is based on a number of imaging sequences such as T1, T2, T1ce, and Flair, and each of them is performed by a group of two-dimensional scans. In practical MRI, some scans are often missing while many medical applications require a full set of scans. An MRI imputation method is presented, which synthesizes such missing scans. Key components in this method are the index registration and the intensity registration. The index registration models anatomical differences between two different scans in the same imaging sequence, and the intensity registration reflects the image contrast differences between two different scans of the same index. Two registration fields are learned to be invariant, and accordingly, allow two estimates of a missing scan, one within corresponding imaging sequence and another along scan index; the two estimates are combined to yield the final synthesized scan. Experimental results highlight that the proposed method improves prevalent limitations existing in previous synthesis methods, blending both structural and contrast aspects and capturing subtle parts of the brain. Quantitative results also show the superiority in various data sets, transitions, and measures.
3D MRI成像是基于T1、T2、T1ce、Flair等多个成像序列,每一个序列都是由一组二维扫描完成的。在实际的MRI中,一些扫描经常缺失,而许多医疗应用需要全套扫描。提出了一种综合这些缺失扫描的MRI补全方法。该方法的关键是指标配准和强度配准。指数配准反映了同一成像序列中两次不同扫描之间的解剖差异,强度配准反映了同一指数两次不同扫描之间的图像对比度差异。两个配准域被学习为不变的,因此,允许对缺失扫描进行两次估计,一次在相应的成像序列内,另一次沿着扫描索引;这两种估计结合起来产生最终的合成扫描。实验结果表明,所提出的方法改善了现有合成方法的普遍局限性,融合了结构和对比度方面,并捕获了大脑的微妙部分。定量结果也显示了在各种数据集、转换和度量方面的优越性。
{"title":"MRI Imputation based on Fused Index- and Intensity-Registration","authors":"Jiyoon Shin, Jungwoo Lee","doi":"10.1109/WACV56688.2023.00199","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00199","url":null,"abstract":"3D MRI imaging is based on a number of imaging sequences such as T1, T2, T1ce, and Flair, and each of them is performed by a group of two-dimensional scans. In practical MRI, some scans are often missing while many medical applications require a full set of scans. An MRI imputation method is presented, which synthesizes such missing scans. Key components in this method are the index registration and the intensity registration. The index registration models anatomical differences between two different scans in the same imaging sequence, and the intensity registration reflects the image contrast differences between two different scans of the same index. Two registration fields are learned to be invariant, and accordingly, allow two estimates of a missing scan, one within corresponding imaging sequence and another along scan index; the two estimates are combined to yield the final synthesized scan. Experimental results highlight that the proposed method improves prevalent limitations existing in previous synthesis methods, blending both structural and contrast aspects and capturing subtle parts of the brain. Quantitative results also show the superiority in various data sets, transitions, and measures.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115141332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time Concealed Weapon Detection on 3D Radar Images for Walk-through Screening System 基于三维雷达图像的隐形武器实时探测技术
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00074
Nagma S. Khan, Kazumine Ogura, E. Cosatto, Masayuki Ariyoshi
This paper presents a framework for real-time concealed weapon detection (CWD) on 3D radar images for walk-through screening systems. The walk-through screening system aims to ensure security in crowded areas by performing CWD on walking persons, hence it requires an accurate and real-time detection approach. To ensure accuracy, a weapon needs to be detected irrespective of its 3D orientation, thus we use the 3D radar images as detection input. For achieving real-time, we reformulate classic U-Net based segmentation networks to perform 3D detection tasks. Our 3D segmentation network predicts peak-shaped probability map, instead of voxel-wise masks, to enable position inference by elementary peak detection operation on the predicted map. In the peak-shaped probability map, the peak marks the weapon’s position. So, weapon detection task translates to peak detection on the probability map. A Gaussian function is used to model weapons in the probability map. We experimentally validate our approach on realistic 3D radar images obtained from a walk-through weapon screening system prototype. Extensive ablation studies verify the effectiveness of our proposed approach over existing conventional approaches. The experimental results demonstrate that our proposed approach can perform accurate and real-time CWD, thus making it suitable for practical applications of walk-through screening.
本文提出了一种基于三维雷达图像的实时隐蔽武器探测框架,该框架适用于穿越筛选系统。巡查系统的目的是透过对行走的人士进行持续巡查,以确保在人员密集地区的安全,因此需要一种准确和实时的侦测方法。为了保证精度,无论武器的三维方向如何,都需要被探测到,因此我们使用三维雷达图像作为探测输入。为了实现实时性,我们重新制定了经典的基于U-Net的分割网络来执行3D检测任务。我们的3D分割网络预测峰形概率图,而不是体素掩码,通过预测图上的基本峰检测操作来实现位置推断。在峰形概率图中,峰标记武器的位置。因此,武器探测任务转化为概率图上的峰值探测。采用高斯函数在概率图中对武器进行建模。我们通过实验验证了我们的方法在真实的三维雷达图像上获得的武器筛选系统原型。广泛的消融研究证实了我们提出的方法优于现有的传统方法的有效性。实验结果表明,该方法可以实现精确、实时的连续随钻,适合于实际应用。
{"title":"Real-time Concealed Weapon Detection on 3D Radar Images for Walk-through Screening System","authors":"Nagma S. Khan, Kazumine Ogura, E. Cosatto, Masayuki Ariyoshi","doi":"10.1109/WACV56688.2023.00074","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00074","url":null,"abstract":"This paper presents a framework for real-time concealed weapon detection (CWD) on 3D radar images for walk-through screening systems. The walk-through screening system aims to ensure security in crowded areas by performing CWD on walking persons, hence it requires an accurate and real-time detection approach. To ensure accuracy, a weapon needs to be detected irrespective of its 3D orientation, thus we use the 3D radar images as detection input. For achieving real-time, we reformulate classic U-Net based segmentation networks to perform 3D detection tasks. Our 3D segmentation network predicts peak-shaped probability map, instead of voxel-wise masks, to enable position inference by elementary peak detection operation on the predicted map. In the peak-shaped probability map, the peak marks the weapon’s position. So, weapon detection task translates to peak detection on the probability map. A Gaussian function is used to model weapons in the probability map. We experimentally validate our approach on realistic 3D radar images obtained from a walk-through weapon screening system prototype. Extensive ablation studies verify the effectiveness of our proposed approach over existing conventional approaches. The experimental results demonstrate that our proposed approach can perform accurate and real-time CWD, thus making it suitable for practical applications of walk-through screening.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"160 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132266865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Meta-Auxiliary Learning for Future Depth Prediction in Videos 视频中未来深度预测的元辅助学习
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00571
Huan Liu, Zhixiang Chi, Yuanhao Yu, Yang Wang, Jun Chen, Jingshan Tang
We consider a new problem of future depth prediction in videos. Given a sequence of observed frames in a video, the goal is to predict the depth map of a future frame that has not been observed yet. Depth estimation plays a vital role for scene understanding and decision-making in intelligent systems. Predicting future depth maps can be valuable for autonomous vehicles to anticipate the behaviours of their surrounding objects. Our proposed model for this problem has a two-branch architecture. One branch is for the primary task of future depth prediction. The other branch is for an auxiliary task of image reconstruction. The auxiliary branch can act as a regularization. Inspired by some recent work on test-time adaption, we use the auxiliary task during testing to adapt the model to a specific test video. We also propose a novel meta-auxiliary learning that learns the model specifically for the purpose of effective test-time adaptation. Experimental results demonstrate that our proposed approach outperforms other alternative methods.
我们考虑了视频中未来深度预测的新问题。给定视频中观察到的一系列帧,目标是预测尚未观察到的未来帧的深度图。深度估计在智能系统的场景理解和决策中起着至关重要的作用。预测未来的深度地图对于自动驾驶汽车预测周围物体的行为很有价值。我们针对这个问题提出的模型具有两个分支的体系结构。一个分支是用于未来深度预测的主要任务。另一个分支是用于图像重建的辅助任务。辅助分支可以作为正则化。受最近一些测试时间自适应工作的启发,我们在测试期间使用辅助任务使模型适应特定的测试视频。我们还提出了一种新的元辅助学习,专门学习模型以有效适应测试时间。实验结果表明,我们提出的方法优于其他替代方法。
{"title":"Meta-Auxiliary Learning for Future Depth Prediction in Videos","authors":"Huan Liu, Zhixiang Chi, Yuanhao Yu, Yang Wang, Jun Chen, Jingshan Tang","doi":"10.1109/WACV56688.2023.00571","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00571","url":null,"abstract":"We consider a new problem of future depth prediction in videos. Given a sequence of observed frames in a video, the goal is to predict the depth map of a future frame that has not been observed yet. Depth estimation plays a vital role for scene understanding and decision-making in intelligent systems. Predicting future depth maps can be valuable for autonomous vehicles to anticipate the behaviours of their surrounding objects. Our proposed model for this problem has a two-branch architecture. One branch is for the primary task of future depth prediction. The other branch is for an auxiliary task of image reconstruction. The auxiliary branch can act as a regularization. Inspired by some recent work on test-time adaption, we use the auxiliary task during testing to adapt the model to a specific test video. We also propose a novel meta-auxiliary learning that learns the model specifically for the purpose of effective test-time adaptation. Experimental results demonstrate that our proposed approach outperforms other alternative methods.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"490 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134005267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Relaxing Contrastiveness in Multimodal Representation Learning 多模态表征学习中的放松对比
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00226
Zudi Lin, Erhan Bas, Kunwar Yashraj Singh, Gurumurthy Swaminathan, Rahul Bhotika
Multimodal representation learning for images with paired raw texts can improve the usability and generality of the learned semantic concepts while significantly reducing annotation costs. In this paper, we explore the design space of loss functions in visual-linguistic pretraining frameworks and propose a novel Relaxed Contrastive (ReCo) objective, which act as a drop-in replacement of the widely used InfoNCE loss. The key insight of ReCo is to allow a relaxed negative space by not penalizing unpaired multimodal samples (i.e., negative pairs) that are already orthogonal or negatively correlated. Unlike the widely-used InfoNCE, which keeps repelling negative pairs as long as they are not anti-correlated, ReCo by design embraces more diversity and flexibility of the learned embeddings. We conduct exten-sive experiments using ReCo with state-of-the-art models by pretraining on the MIMIC-CXR dataset that consists of chest radiographs and free-text radiology reports, and eval-uating on the CheXpert dataset for multimodal retrieval and disease classification. Our ReCo achieves an absolute improvement of 2.9% over the InfoNCE baseline on the CheXpert Retrieval dataset in average retrieval precision and re-ports better or comparable performance in the linear evaluation and finetuning for classification. We further show that ReCo outperforms InfoNCE on the Flickr30K dataset by 1.7% in retrieval Recall@1, demonstrating the generalizability of our approach to natural images.
对原始文本配对的图像进行多模态表示学习可以提高学习到的语义概念的可用性和通用性,同时显著降低标注成本。在本文中,我们探索了视觉语言预训练框架中损失函数的设计空间,并提出了一种新的松弛对比(ReCo)目标,作为广泛使用的InfoNCE损失的替代。ReCo的关键观点是,通过不惩罚已经正交或负相关的未配对多模态样本(即负对),允许一个宽松的负空间。与广泛使用的InfoNCE不同,只要它们不是反相关的,它就会一直排斥负对,而ReCo在设计上包含了更多的学习嵌入的多样性和灵活性。我们通过对MIMIC-CXR数据集(由胸部x线片和自由文本放射学报告组成)进行预训练,并对CheXpert数据集进行多模式检索和疾病分类评估,使用ReCo与最先进的模型进行了广泛的实验。在CheXpert检索数据集上,我们的ReCo在平均检索精度方面比InfoNCE基线绝对提高了2.9%,并且在分类的线性评估和微调方面报告了更好或可比的性能。我们进一步表明,ReCo在Flickr30K数据集上的检索性能比InfoNCE高出1.7% Recall@1,证明了我们的方法在自然图像上的可推广性。
{"title":"Relaxing Contrastiveness in Multimodal Representation Learning","authors":"Zudi Lin, Erhan Bas, Kunwar Yashraj Singh, Gurumurthy Swaminathan, Rahul Bhotika","doi":"10.1109/WACV56688.2023.00226","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00226","url":null,"abstract":"Multimodal representation learning for images with paired raw texts can improve the usability and generality of the learned semantic concepts while significantly reducing annotation costs. In this paper, we explore the design space of loss functions in visual-linguistic pretraining frameworks and propose a novel Relaxed Contrastive (ReCo) objective, which act as a drop-in replacement of the widely used InfoNCE loss. The key insight of ReCo is to allow a relaxed negative space by not penalizing unpaired multimodal samples (i.e., negative pairs) that are already orthogonal or negatively correlated. Unlike the widely-used InfoNCE, which keeps repelling negative pairs as long as they are not anti-correlated, ReCo by design embraces more diversity and flexibility of the learned embeddings. We conduct exten-sive experiments using ReCo with state-of-the-art models by pretraining on the MIMIC-CXR dataset that consists of chest radiographs and free-text radiology reports, and eval-uating on the CheXpert dataset for multimodal retrieval and disease classification. Our ReCo achieves an absolute improvement of 2.9% over the InfoNCE baseline on the CheXpert Retrieval dataset in average retrieval precision and re-ports better or comparable performance in the linear evaluation and finetuning for classification. We further show that ReCo outperforms InfoNCE on the Flickr30K dataset by 1.7% in retrieval Recall@1, demonstrating the generalizability of our approach to natural images.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"262 1-2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134604892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Marker-removal Networks to Collect Precise 3D Hand Data for RGB-based Estimation and its Application in Piano 基于rgb估计的精确三维手部数据标记去除网络及其在钢琴中的应用
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00299
Erwin Wu, Hayato Nishioka, Shinichi Furuya, H. Koike
Hand pose analysis is a key step to understanding dexterous hand performances of many high-level skills, such as playing the piano. Currently, most accurate hand tracking systems are using fabric-/marker-based sensing that potentially disturbs users’ performance. On the other hand, markerless computer vision-based methods rely on a precise bare-hand dataset for training, which is difficult to obtain. In this paper, we collect a large-scale high precision 3D hand pose dataset with a small workload using a marker-removal network (MR-Net). The proposed MR-Net translates the marked-hand images to realistic bare-hand images, and the corresponding 3D postures are captured by a motion capture thus few manual annotations are required. A baseline estimation network PiaNet is introduced and we report the accuracy of various metrics together with a blind qualitative test to show the practical effect.
手姿分析是理解许多高级技能(如钢琴演奏)灵巧手的关键步骤。目前,大多数精确的手部跟踪系统都使用基于织物/标记的传感,这可能会干扰用户的表现。另一方面,基于计算机视觉的无标记方法依赖于精确的徒手数据集进行训练,这很难获得。在本文中,我们使用标记去除网络(MR-Net)以较小的工作量收集了大规模高精度3D手姿数据集。提出的MR-Net将标记手图像转换为逼真的徒手图像,并通过动作捕捉捕捉相应的3D姿势,因此很少需要手动注释。介绍了一种基线估计网络planet,报告了各种指标的准确性,并进行了盲定性测试,以显示实际效果。
{"title":"Marker-removal Networks to Collect Precise 3D Hand Data for RGB-based Estimation and its Application in Piano","authors":"Erwin Wu, Hayato Nishioka, Shinichi Furuya, H. Koike","doi":"10.1109/WACV56688.2023.00299","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00299","url":null,"abstract":"Hand pose analysis is a key step to understanding dexterous hand performances of many high-level skills, such as playing the piano. Currently, most accurate hand tracking systems are using fabric-/marker-based sensing that potentially disturbs users’ performance. On the other hand, markerless computer vision-based methods rely on a precise bare-hand dataset for training, which is difficult to obtain. In this paper, we collect a large-scale high precision 3D hand pose dataset with a small workload using a marker-removal network (MR-Net). The proposed MR-Net translates the marked-hand images to realistic bare-hand images, and the corresponding 3D postures are captured by a motion capture thus few manual annotations are required. A baseline estimation network PiaNet is introduced and we report the accuracy of various metrics together with a blind qualitative test to show the practical effect.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133115838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Graph-Based Self-Learning for Robust Person Re-identification 基于图的鲁棒人物再识别自学习
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00477
Yuqiao Xian, Jinrui Yang, Fufu Yu, Jun Zhang, Xing Sun
Existing deep learning approaches for person re-identification (Re-ID) mostly rely on large-scale and well-annotated training data. However, human-annotated labels are prone to label noise in real-world applications. Previous person Re-ID works mainly focus on random label noise, which doesn’t properly reflect the characteristic of label noise in practical human-annotated process. In this work, we find the visual ambiguity noise is more common and reasonable noise assumption in annotation of person Re-ID. To handle the kind of noise, we propose a simple and effective robust person Re-ID framework, namely Graph-Based Self-Learning (GBSL), to iteratively learn discriminative representation and rectify noisy labels with limited annotated samples for each identity. Meanwhile, considering the practical annotation process in person Re-ID, we further extend the visual ambiguity noise assumption and propose a type of more practical label noise in person Re-ID, namely the tracklet-level label noise (TLN). Without modifying network architecture or loss function, our approach significantly improves the robustness against label noise of the Re-ID system. Our model obtains competitive performance with training data corrupted by various types of label noise and outperforms the existing methods for robust Re-ID on public benchmarks.
现有的人再识别(Re-ID)深度学习方法主要依赖于大规模和注释良好的训练数据。然而,人工标注的标签在实际应用中容易产生标签噪声。以往的人员Re-ID工作主要集中在随机标签噪声上,不能很好地反映实际人工标注过程中标签噪声的特点。在本研究中,我们发现视觉歧义噪声是人物Re-ID标注中较为常见和合理的噪声假设。为了处理这类噪声,我们提出了一种简单有效的鲁棒性人物Re-ID框架,即基于图的自学习(GBSL),迭代学习判别表示,并对每个身份使用有限的注释样本来校正噪声标签。同时,考虑到个人Re-ID的实际标注过程,我们进一步扩展了视觉模糊噪声假设,提出了一种更实用的个人Re-ID标签噪声,即轨道级标签噪声(TLN)。在不修改网络结构或损失函数的情况下,我们的方法显著提高了Re-ID系统对标签噪声的鲁棒性。我们的模型在被各种类型的标签噪声损坏的训练数据中获得了具有竞争力的性能,并且在公共基准测试中优于现有的鲁棒Re-ID方法。
{"title":"Graph-Based Self-Learning for Robust Person Re-identification","authors":"Yuqiao Xian, Jinrui Yang, Fufu Yu, Jun Zhang, Xing Sun","doi":"10.1109/WACV56688.2023.00477","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00477","url":null,"abstract":"Existing deep learning approaches for person re-identification (Re-ID) mostly rely on large-scale and well-annotated training data. However, human-annotated labels are prone to label noise in real-world applications. Previous person Re-ID works mainly focus on random label noise, which doesn’t properly reflect the characteristic of label noise in practical human-annotated process. In this work, we find the visual ambiguity noise is more common and reasonable noise assumption in annotation of person Re-ID. To handle the kind of noise, we propose a simple and effective robust person Re-ID framework, namely Graph-Based Self-Learning (GBSL), to iteratively learn discriminative representation and rectify noisy labels with limited annotated samples for each identity. Meanwhile, considering the practical annotation process in person Re-ID, we further extend the visual ambiguity noise assumption and propose a type of more practical label noise in person Re-ID, namely the tracklet-level label noise (TLN). Without modifying network architecture or loss function, our approach significantly improves the robustness against label noise of the Re-ID system. Our model obtains competitive performance with training data corrupted by various types of label noise and outperforms the existing methods for robust Re-ID on public benchmarks.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133455587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ProtoSeg: Interpretable Semantic Segmentation with Prototypical Parts ProtoSeg:具有原型部件的可解释语义分割
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00153
Mikolaj Sacha, Dawid Rymarczyk, Lukasz Struski, J. Tabor, Bartosz Zieli'nski
We introduce ProtoSeg, a novel model for interpretable semantic image segmentation, which constructs its predictions using similar patches from the training set. To achieve accuracy comparable to baseline methods, we adapt the mechanism of prototypical parts and introduce a diversity loss function that increases the variety of prototypes within each class. We show that ProtoSeg discovers semantic concepts, in contrast to standard segmentation models. Experiments conducted on Pascal VOC and Cityscapes datasets confirm the precision and transparency of the presented method.
我们引入了一种新的可解释语义图像分割模型ProtoSeg,该模型使用来自训练集的相似补丁构建其预测。为了达到与基线方法相当的精度,我们调整了原型部件的机制,并引入了多样性损失函数,增加了每个类别中原型的多样性。我们展示了ProtoSeg发现语义概念,而不是标准的分割模型。在Pascal VOC和cityscape数据集上进行的实验证实了该方法的准确性和透明度。
{"title":"ProtoSeg: Interpretable Semantic Segmentation with Prototypical Parts","authors":"Mikolaj Sacha, Dawid Rymarczyk, Lukasz Struski, J. Tabor, Bartosz Zieli'nski","doi":"10.1109/WACV56688.2023.00153","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00153","url":null,"abstract":"We introduce ProtoSeg, a novel model for interpretable semantic image segmentation, which constructs its predictions using similar patches from the training set. To achieve accuracy comparable to baseline methods, we adapt the mechanism of prototypical parts and introduce a diversity loss function that increases the variety of prototypes within each class. We show that ProtoSeg discovers semantic concepts, in contrast to standard segmentation models. Experiments conducted on Pascal VOC and Cityscapes datasets confirm the precision and transparency of the presented method.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133484483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Dynamic Re-weighting for Long-tailed Semi-supervised Learning 长尾半监督学习的动态重加权
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00640
Hanyu Peng, Weiguo Pian, Mingming Sun, P. Li
Semi-supervised Learning (SSL) reduces significant human annotations by simply demanding a small number of labelled samples and a large number of unlabelled samples. The research community has often developed SSL regarding the nature of a balanced data set; in contrast, real data is often imbalanced or even long-tailed. The need to study SSL under imbalance is therefore critical. In this paper, we essentially extend FixMatch (a SSL method) to the imbalanced case. We find that the unlabeled data is as well highly imbalanced during the training process; in this respect we propose a re-weighting solution based on the effective number. Furthermore, since prediction uncertainty leads to temporal variations in the number of pseudo-labels, we are innovative in proposing a dynamic reweighting scheme on the unlabeled data. The simplicity and validity of our method are backed up by experimental evidence. Especially on CIFAR-10, CIFAR-100, ImageNet127 data sets, our approach provides the strongest results against previous methods across various scales of imbalance.
半监督学习(SSL)通过简单地要求少量标记样本和大量未标记样本,减少了大量的人工注释。研究界经常根据平衡数据集的性质开发SSL;相比之下,真实数据往往是不平衡的,甚至是长尾的。因此,研究不平衡情况下的SSL是非常重要的。在本文中,我们将FixMatch(一个SSL方法)扩展到不平衡的情况。我们发现,在训练过程中,未标记的数据也高度不平衡;在这方面,我们提出了一种基于有效数的重加权解决方案。此外,由于预测不确定性导致伪标签数量的时间变化,我们创新地提出了一种针对未标记数据的动态重加权方案。实验证明了该方法的简单性和有效性。特别是在CIFAR-10, CIFAR-100, ImageNet127数据集上,我们的方法在各种不平衡尺度上比以前的方法提供了最强的结果。
{"title":"Dynamic Re-weighting for Long-tailed Semi-supervised Learning","authors":"Hanyu Peng, Weiguo Pian, Mingming Sun, P. Li","doi":"10.1109/WACV56688.2023.00640","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00640","url":null,"abstract":"Semi-supervised Learning (SSL) reduces significant human annotations by simply demanding a small number of labelled samples and a large number of unlabelled samples. The research community has often developed SSL regarding the nature of a balanced data set; in contrast, real data is often imbalanced or even long-tailed. The need to study SSL under imbalance is therefore critical. In this paper, we essentially extend FixMatch (a SSL method) to the imbalanced case. We find that the unlabeled data is as well highly imbalanced during the training process; in this respect we propose a re-weighting solution based on the effective number. Furthermore, since prediction uncertainty leads to temporal variations in the number of pseudo-labels, we are innovative in proposing a dynamic reweighting scheme on the unlabeled data. The simplicity and validity of our method are backed up by experimental evidence. Especially on CIFAR-10, CIFAR-100, ImageNet127 data sets, our approach provides the strongest results against previous methods across various scales of imbalance.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126971459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1