2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)最新文献_第8页

Leaky Gated Cross-Attention for Weakly Supervised Multi-Modal Temporal Action Localization 弱监督多模态时间动作定位的漏门交叉注意

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00089

Jun-Tae Lee, Sungrack Yun, Mihir Jain

As multiple modalities sometimes have a weak complementary relationship, multi-modal fusion is not always beneficial for weakly supervised action localization. Hence, to attain the adaptive multi-modal fusion, we propose a leaky gated cross-attention mechanism. In our work, we take the multi-stage cross-attention as the baseline fusion module to obtain multi-modal features. Then, for the stages of each modality, we design gates to decide the dependency on the other modality. For each input frame, if two modalities have a strong complementary relationship, the gate selects the cross-attended feature, otherwise the non-attended feature. Also, the proposed gate allows the non-selected feature to escape through it with a small intensity, we call it leaky gate. This leaky feature makes effective regularization of the selected major feature. Therefore, our leaky gating makes cross-attention more adaptable and robust even when the modalities have a weak complementary relationship. The proposed leaky gated cross-attention provides a modality fusion module that is generally compatible with various temporal action localization methods. To show its effectiveness, we do extensive experimental analysis and apply the proposed method to boost the performance of the state-of-the-art methods on two benchmark datasets (ActivityNet1.2 and THUMOS14).

由于多模态有时存在弱互补关系，多模态融合并不总是有利于弱监督动作定位。因此，为了实现自适应多模态融合，我们提出了一种漏门交叉注意机制。在我们的工作中，我们以多阶段交叉注意作为基线融合模块来获得多模态特征。然后，对于每个模态的阶段，我们设计闸门来确定与其他模态的依赖关系。对于每一个输入帧，如果两个模态有很强的互补关系，则门选择交叉参与特征，否则选择非参与特征。此外，所提出的门允许非选择的特征以较小的强度通过它，我们称之为漏门。该泄漏特征对所选的主要特征进行了有效的正则化。因此，我们的泄漏门控使交叉注意更具适应性和鲁棒性，即使在模式具有弱互补关系的情况下。所提出的泄漏门控交叉注意提供了一种与各种时间动作定位方法普遍兼容的模态融合模块。为了证明其有效性，我们进行了广泛的实验分析，并应用所提出的方法在两个基准数据集(ActivityNet1.2和THUMOS14)上提高了最先进方法的性能。

{"title":"Leaky Gated Cross-Attention for Weakly Supervised Multi-Modal Temporal Action Localization","authors":"Jun-Tae Lee, Sungrack Yun, Mihir Jain","doi":"10.1109/WACV51458.2022.00089","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00089","url":null,"abstract":"As multiple modalities sometimes have a weak complementary relationship, multi-modal fusion is not always beneficial for weakly supervised action localization. Hence, to attain the adaptive multi-modal fusion, we propose a leaky gated cross-attention mechanism. In our work, we take the multi-stage cross-attention as the baseline fusion module to obtain multi-modal features. Then, for the stages of each modality, we design gates to decide the dependency on the other modality. For each input frame, if two modalities have a strong complementary relationship, the gate selects the cross-attended feature, otherwise the non-attended feature. Also, the proposed gate allows the non-selected feature to escape through it with a small intensity, we call it leaky gate. This leaky feature makes effective regularization of the selected major feature. Therefore, our leaky gating makes cross-attention more adaptable and robust even when the modalities have a weak complementary relationship. The proposed leaky gated cross-attention provides a modality fusion module that is generally compatible with various temporal action localization methods. To show its effectiveness, we do extensive experimental analysis and apply the proposed method to boost the performance of the state-of-the-art methods on two benchmark datasets (ActivityNet1.2 and THUMOS14).","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128614798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Automated Defect Inspection in Reverse Engineering of Integrated Circuits 集成电路逆向工程中的缺陷自动检测

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00187

A. Bette, Patrick Brus, G. Balázs, Matthias Ludwig, Alois Knoll

In the semiconductor industry, reverse engineering is used to extract information from microchips. Circuit extraction is becoming increasingly difficult due to the continuous technology shrinking. A high quality reverse engineering process is challenged by various defects coming from chip preparation and imaging errors. Currently, no automated, technology-agnostic defect inspection framework is available. To meet the requirements of the mostly manual reverse engineering process, the proposed automated frame- work needs to handle highly imbalanced data, as well as unknown and multiple defect classes. We propose a network architecture that is composed of a shared Xception- based feature extractor and multiple, individually trainable binary classification heads: the HydREnet. We evaluated our defect classifier on three challenging industrial datasets and achieved accuracies of over 85 %, even for underrepresented classes. With this framework, the manual inspection effort can be reduced down to 5 %.

在半导体工业中，逆向工程用于从微芯片中提取信息。由于技术的不断萎缩，电路提取变得越来越困难。高质量的逆向工程过程受到来自芯片制备和成像误差的各种缺陷的挑战。目前，还没有自动化的、与技术无关的缺陷检查框架。为了满足大多数手工逆向工程过程的需求，所提出的自动化框架需要处理高度不平衡的数据，以及未知和多个缺陷类。我们提出了一个由共享的基于异常的特征提取器和多个可单独训练的二元分类头组成的网络架构:HydREnet。我们在三个具有挑战性的工业数据集上评估了我们的缺陷分类器，即使对于代表性不足的类别，准确率也超过了85%。有了这个框架，人工检查的工作量可以减少到5%。

引用次数: 6

A Context-enriched Satellite Imagery Dataset and an Approach for Parking Lot Detection 上下文丰富的卫星图像数据集与停车场检测方法

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00146

Yifang Yin, Wenmiao Hu, An Tran, H. Kruppa, Roger Zimmermann, See-Kiong Ng

Automatic detection of geoinformation from satellite images has been a fundamental yet challenging problem, which aims to reduce the manual effort of human annotators in maintaining an up-to-date digital map. There are currently several high-resolution satellite imagery datasets that are publicly available. However, the associated ground-truth annotations are limited to road, building, and land use, while the annotations of other geographic objects or attributes are mostly not available. To bridge the gap, we present Grab-Pklot, the first high-resolution and context-enriched satellite imagery dataset for parking lot detection. Our dataset consists of 1344 satellite images with the ground-truth annotations of carparks in Singapore. Motivated by the observation that carparks are mostly co-appear with other geographic objects, we associate each satellite image in our dataset with the surrounding contextual information of road and building, given in the format of multi-channel images. As a side contribution, we present a fusion-based segmentation approach to demonstrate that the parking lot detection accuracy can be improved by modeling the correlations between parking lots and other geographic objects. Experiments on our dataset provide baseline results as well as new insights into the challenges and opportunities in parking lot detection from satellite images.

从卫星图像中自动检测地理信息一直是一个基本但具有挑战性的问题，其目的是减少人类注释者在维护最新数字地图方面的手工工作。目前有几个公开可用的高分辨率卫星图像数据集。然而，相关的ground-truth注释仅限于道路、建筑物和土地使用，而其他地理对象或属性的注释大多不可用。为了弥补这一差距，我们提出了Grab-Pklot，这是第一个用于停车场检测的高分辨率和上下文丰富的卫星图像数据集。我们的数据集由1344张卫星图像和新加坡停车场的地面真实注释组成。由于停车场大多与其他地理对象共同出现，我们将数据集中的每张卫星图像与周围的道路和建筑物的上下文信息联系起来，以多通道图像的格式给出。此外，我们提出了一种基于融合的分割方法，以证明通过建模停车场与其他地理对象之间的相关性可以提高停车场检测精度。在我们的数据集上进行的实验提供了基线结果，以及对卫星图像中停车场检测的挑战和机遇的新见解。

{"title":"A Context-enriched Satellite Imagery Dataset and an Approach for Parking Lot Detection","authors":"Yifang Yin, Wenmiao Hu, An Tran, H. Kruppa, Roger Zimmermann, See-Kiong Ng","doi":"10.1109/WACV51458.2022.00146","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00146","url":null,"abstract":"Automatic detection of geoinformation from satellite images has been a fundamental yet challenging problem, which aims to reduce the manual effort of human annotators in maintaining an up-to-date digital map. There are currently several high-resolution satellite imagery datasets that are publicly available. However, the associated ground-truth annotations are limited to road, building, and land use, while the annotations of other geographic objects or attributes are mostly not available. To bridge the gap, we present Grab-Pklot, the first high-resolution and context-enriched satellite imagery dataset for parking lot detection. Our dataset consists of 1344 satellite images with the ground-truth annotations of carparks in Singapore. Motivated by the observation that carparks are mostly co-appear with other geographic objects, we associate each satellite image in our dataset with the surrounding contextual information of road and building, given in the format of multi-channel images. As a side contribution, we present a fusion-based segmentation approach to demonstrate that the parking lot detection accuracy can be improved by modeling the correlations between parking lots and other geographic objects. Experiments on our dataset provide baseline results as well as new insights into the challenges and opportunities in parking lot detection from satellite images.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121110556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

TypeNet: Towards Camera Enabled Touch Typing on Flat Surfaces through Self-Refinement TypeNet:通过自我改进，在平面上实现相机触摸打字

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00064

Ben Maman, Amit H. Bermano

Text entry for mobile devices nowadays is an equally crucial and time-consuming task, with no practical solution available for natural typing speeds without extra hardware. In this paper, we introduce a real-time method that is a significant step towards enabling touch typing on arbitrary flat surfaces (e.g., tables). The method employs only a simple video camera, placed in front of the user on the flat surface — at an angle practical for mobile usage. To achieve this, we adopt a classification framework, based on the observation that, in touch typing, similar hand configurations imply the same typed character across users. Importantly, this approach allows the convenience of un-calibrated typing, where the hand positions, with respect to the camera and each other, are not dictated.To improve accuracy, we propose a Language Processing scheme, which corrects the typed text and is specifically designed for real-time performance and integration with the vision-based signal. To enable feasible data collection and training, we propose a self-refinement approach that allows training on unlabeled flat-surface-typing footage; A network trained on (labeled) keyboard footage labels flat-surface videos using dynamic time warping, and is trained on them, in an Expectation Maximization (EM) manner.Using these techniques, we introduce the TypingHands26 Dataset, comprising videos of 26 different users typing on a keyboard, and 10 users typing on a flat surface, labeled at the frame level. We validate our approach and present a single camera-based system with character-level accuracy of 93.5% on average for known users, and 85.7% for unknown ones, outperforming pose-estimation-based methods by a large margin, despite performing at natural typing speeds of up to 80 Words Per Minute. Our method is the first to rely on a simple camera alone, and runs in interactive speeds, while still maintaining accuracy comparable to systems employing non-commodity equipment.

如今，移动设备的文本输入是一项同样重要且耗时的任务，没有实用的解决方案可以在没有额外硬件的情况下实现自然的输入速度。在本文中，我们介绍了一种实时方法，这是在任意平面(例如，桌子)上实现触摸打字的重要一步。这种方法只需要一个简单的摄像机，放在用户面前的平面上，以一个适合移动使用的角度。为了实现这一点，我们采用了一个分类框架，基于观察，在触摸打字中，相似的手配置意味着用户之间输入的字符相同。重要的是，这种方法可以方便地进行未经校准的打字，在这种情况下，手的位置，相对于相机和彼此，是不指定的。为了提高准确性，我们提出了一种语言处理方案，该方案对输入的文本进行校正，并专门设计用于实时性能和与基于视觉的信号集成。为了实现可行的数据收集和训练，我们提出了一种自我改进的方法，允许在未标记的平面打字素材上进行训练;在(标记的)键盘视频上训练的网络使用动态时间翘曲标记平面视频，并以期望最大化(EM)的方式对其进行训练。使用这些技术，我们引入了TypingHands26数据集，其中包括26个不同用户在键盘上打字的视频，以及10个用户在平面上打字的视频，在帧级别进行标记。我们验证了我们的方法，并提出了一个基于相机的系统，对于已知用户，字符级准确率平均为93.5%，对于未知用户，准确率为85.7%，大大优于基于姿势估计的方法，尽管其自然打字速度高达每分钟80个单词。我们的方法是第一个仅依靠简单的相机，并以交互速度运行，同时仍然保持与使用非商品设备的系统相当的准确性。

{"title":"TypeNet: Towards Camera Enabled Touch Typing on Flat Surfaces through Self-Refinement","authors":"Ben Maman, Amit H. Bermano","doi":"10.1109/WACV51458.2022.00064","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00064","url":null,"abstract":"Text entry for mobile devices nowadays is an equally crucial and time-consuming task, with no practical solution available for natural typing speeds without extra hardware. In this paper, we introduce a real-time method that is a significant step towards enabling touch typing on arbitrary flat surfaces (e.g., tables). The method employs only a simple video camera, placed in front of the user on the flat surface — at an angle practical for mobile usage. To achieve this, we adopt a classification framework, based on the observation that, in touch typing, similar hand configurations imply the same typed character across users. Importantly, this approach allows the convenience of un-calibrated typing, where the hand positions, with respect to the camera and each other, are not dictated.To improve accuracy, we propose a Language Processing scheme, which corrects the typed text and is specifically designed for real-time performance and integration with the vision-based signal. To enable feasible data collection and training, we propose a self-refinement approach that allows training on unlabeled flat-surface-typing footage; A network trained on (labeled) keyboard footage labels flat-surface videos using dynamic time warping, and is trained on them, in an Expectation Maximization (EM) manner.Using these techniques, we introduce the TypingHands26 Dataset, comprising videos of 26 different users typing on a keyboard, and 10 users typing on a flat surface, labeled at the frame level. We validate our approach and present a single camera-based system with character-level accuracy of 93.5% on average for known users, and 85.7% for unknown ones, outperforming pose-estimation-based methods by a large margin, despite performing at natural typing speeds of up to 80 Words Per Minute. Our method is the first to rely on a simple camera alone, and runs in interactive speeds, while still maintaining accuracy comparable to systems employing non-commodity equipment.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"583 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116176071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Tailor Me: An Editing Network for Fashion Attribute Shape Manipulation 裁缝我:一个时尚属性形状操纵的编辑网络

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00320

Youngjoon Kwon, Stefano Petrangeli, Dahun Kim, Haoliang Wang, Viswanathan Swaminathan, H. Fuchs

Fashion attribute editing aims to manipulate fashion images based on a user-specified attribute, while preserving the details of the original image as intact as possible. Recent works in this domain have mainly focused on direct manipulation of the raw RGB pixels, which only allows to perform edits involving relatively small shape changes (e.g., sleeves). The goal of our Virtual Personal Tailoring Network (VPTNet) is to extend the editing capabilities to much larger shape changes of fashion items, such as cloth length. To achieve this goal, we decouple the fashion attribute editing task into two conditional stages: shape-then-appearance editing. To this aim, we propose a shape editing network that employs a semantic parsing of the fashion image as an interface for manipulation. Compared to operating on the raw RGB image, our parsing map editing enables performing more complex shape editing operations. Second, we introduce an appearance completion network that takes the previous stage results and completes the shape difference regions to produce the final RGB image. Qualitative and quantitative experiments on the DeepFashion-Synthesis dataset confirm that VPTNet outperforms state-of-the-art methods for both small and large shape attribute editing.

时尚属性编辑的目的是基于用户指定的属性对时尚图像进行操作，同时尽可能完整地保留原始图像的细节。最近在这个领域的工作主要集中在直接操作原始RGB像素，这只允许执行涉及相对较小的形状变化的编辑(例如，袖子)。我们的虚拟个人裁剪网络(VPTNet)的目标是将编辑功能扩展到时尚物品的更大形状变化，例如布料长度。为了实现这一目标，我们将时尚属性编辑任务解耦为两个有条件的阶段:形状-外观编辑。为此，我们提出了一个形状编辑网络，该网络采用时尚图像的语义解析作为操作接口。与在原始RGB图像上操作相比，我们的解析地图编辑可以执行更复杂的形状编辑操作。其次，我们引入了一个外观补全网络，该网络利用前一阶段的结果并完成形状差异区域以生成最终的RGB图像。在DeepFashion-Synthesis数据集上进行的定性和定量实验证实，VPTNet在大小形状属性编辑方面都优于最先进的方法。

{"title":"Tailor Me: An Editing Network for Fashion Attribute Shape Manipulation","authors":"Youngjoon Kwon, Stefano Petrangeli, Dahun Kim, Haoliang Wang, Viswanathan Swaminathan, H. Fuchs","doi":"10.1109/WACV51458.2022.00320","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00320","url":null,"abstract":"Fashion attribute editing aims to manipulate fashion images based on a user-specified attribute, while preserving the details of the original image as intact as possible. Recent works in this domain have mainly focused on direct manipulation of the raw RGB pixels, which only allows to perform edits involving relatively small shape changes (e.g., sleeves). The goal of our Virtual Personal Tailoring Network (VPTNet) is to extend the editing capabilities to much larger shape changes of fashion items, such as cloth length. To achieve this goal, we decouple the fashion attribute editing task into two conditional stages: shape-then-appearance editing. To this aim, we propose a shape editing network that employs a semantic parsing of the fashion image as an interface for manipulation. Compared to operating on the raw RGB image, our parsing map editing enables performing more complex shape editing operations. Second, we introduce an appearance completion network that takes the previous stage results and completes the shape difference regions to produce the final RGB image. Qualitative and quantitative experiments on the DeepFashion-Synthesis dataset confirm that VPTNet outperforms state-of-the-art methods for both small and large shape attribute editing.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124130891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Symmetric-light Photometric Stereo 对称光光度立体

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00039

Kazuma Minami, Hiroaki Santo, Fumio Okura, Y. Matsushita

This paper presents symmetric-light photometric stereo for surface normal estimation, in which directional lights are distributed symmetrically with respect to the optic center. Unlike previous studies of ring-light settings that required the information of ring radius, we show that even without the knowledge of the exact light source locations or their distances from the optic center, the symmetric configuration provides us sufficient information for recovering unique surface normals without ambiguity. Specifically, under the symmetric lights, measurements of a pair of scene points having distinct surface normals but the same albedo yield a system of constrained quadratic equations about the surface normal, which has a unique solution. Experiments demonstrate that the proposed method alleviates the need for geometric light source calibration while maintaining the accuracy of calibrated photometric stereo.

本文提出了一种用于曲面法线估计的对称光光度立体，其中方向光相对于光心对称分布。不同于以往的环光设置需要环半径信息的研究，我们表明，即使不知道确切的光源位置或它们与光中心的距离，对称配置也为我们提供了足够的信息，可以在没有歧义的情况下恢复唯一的表面法线。具体地说，在对称光照下，测量一对具有不同表面法线但反照率相同的场景点，得到一个关于表面法线的约束二次方程系统，该系统具有唯一解。实验表明，该方法在保证校正后的光度立体精度的同时，减少了几何光源标定的需要。

引用次数: 3

Co-Segmentation Aided Two-Stream Architecture for Video Captioning 视频字幕的协同分割辅助双流架构

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00250

Jayesh Vaidya, Arulkumar Subramaniam, Anurag Mittal

The goal of video captioning is to generate captions for a video by understanding visual and temporal cues. A general video captioning model consists of an Encoder-Decoder framework where Encoder generally captures the visual and temporal information while the decoder generates captions. Recent works have incorporated object-level information into the Encoder by a pretrained off-the-shelf object detector, significantly improving performance. However, using an object detector comes with the following downsides: 1) object detectors may not exhaustively capture all the object categories. 2) In a realistic setting, the performance may be influenced by the domain gap between the object-detector and the visual-captioning dataset. To remedy this, we argue that using an external object detector could be eliminated if the model is equipped with the capability of automatically finding salient regions. To achieve this, we propose a novel architecture that learns to attend to salient regions such as objects, persons automatically using a co-segmentation inspired attention module. Then, we utilize a novel salient region interaction module to promote information propagation between salient regions of adjacent frames. Further, we incorporate this salient region-level information into the model using knowledge distillation. We evaluate our model on two benchmark datasets MSR-VTT and MSVD, and show that our model achieves competitive performance without using any object detector.

视频字幕的目标是通过理解视觉和时间线索为视频生成字幕。一般的视频字幕模型由编码器-解码器框架组成，其中编码器通常捕获视觉和时间信息，而解码器生成字幕。最近的工作通过预训练的现成对象检测器将对象级信息整合到编码器中，显著提高了性能。然而，使用对象检测器有以下缺点:1)对象检测器可能无法详尽地捕获所有对象类别。2)在现实环境中，目标检测器和视觉字幕数据集之间的域间隙可能会影响性能。为了弥补这一点，我们认为，如果模型配备了自动发现显著区域的能力，则可以消除使用外部目标检测器。为了实现这一目标，我们提出了一种新的架构，该架构可以学习关注突出区域，如物体、人，自动使用共同分割启发的注意力模块。然后，我们利用一个新的显著区域交互模块来促进相邻帧显著区域之间的信息传播。此外，我们利用知识蒸馏将这些显著的区域级信息纳入模型。我们在两个基准数据集MSR-VTT和MSVD上对我们的模型进行了评估，并表明我们的模型在不使用任何目标检测器的情况下取得了具有竞争力的性能。

{"title":"Co-Segmentation Aided Two-Stream Architecture for Video Captioning","authors":"Jayesh Vaidya, Arulkumar Subramaniam, Anurag Mittal","doi":"10.1109/WACV51458.2022.00250","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00250","url":null,"abstract":"The goal of video captioning is to generate captions for a video by understanding visual and temporal cues. A general video captioning model consists of an Encoder-Decoder framework where Encoder generally captures the visual and temporal information while the decoder generates captions. Recent works have incorporated object-level information into the Encoder by a pretrained off-the-shelf object detector, significantly improving performance. However, using an object detector comes with the following downsides: 1) object detectors may not exhaustively capture all the object categories. 2) In a realistic setting, the performance may be influenced by the domain gap between the object-detector and the visual-captioning dataset. To remedy this, we argue that using an external object detector could be eliminated if the model is equipped with the capability of automatically finding salient regions. To achieve this, we propose a novel architecture that learns to attend to salient regions such as objects, persons automatically using a co-segmentation inspired attention module. Then, we utilize a novel salient region interaction module to promote information propagation between salient regions of adjacent frames. Further, we incorporate this salient region-level information into the model using knowledge distillation. We evaluate our model on two benchmark datasets MSR-VTT and MSVD, and show that our model achieves competitive performance without using any object detector.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126739883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Resource-efficient Hybrid X-formers for Vision 用于视觉的资源高效混合x -former

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00361

Pranav Jeevan, A. Sethi

Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks for computer vision. The attention mechanism of transformers scales quadratically with the length of the input sequence, and unrolled images have long sequence lengths. Plus, transformers lack an inductive bias that is appropriate for images. We tested three modifications to vision transformer (ViT) architectures that address these shortcomings. Firstly, we alleviate the quadratic bottleneck by using linear attention mechanisms, called X-formers (such that, X ∈{Performer, Linformer, Nyströmformer}), thereby creating Vision X-formers (ViXs). This resulted in up to a seven times reduction in the GPU memory requirement. We also compared their performance with FNet and multi-layer perceptron mixers, which further reduced the GPU memory requirement. Secondly, we introduced an inductive prior for images by replacing the initial linear embedding layer by convolutional layers in ViX, which significantly increased classification accuracy without increasing the model size. Thirdly, we replaced the learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE), which increases the classification accuracy for the same model size. We believe that incorporating such changes can democratize transformers by making them accessible to those with limited data and computing resources.

虽然变压器已经成为自然语言处理的首选神经架构，但为了与计算机视觉的卷积神经网络竞争，它们需要更多的训练数据、GPU内存和计算量。变压器的注意机制与输入序列的长度成二次关系，而展开的图像具有较长的序列长度。另外，变压器缺乏适合图像的感应偏置。我们测试了对视觉转换器(ViT)架构的三种修改，以解决这些缺点。首先，我们通过使用线性注意力机制来缓解二次瓶颈，称为X-former(这样，X∈{Performer, Linformer, Nyströmformer})，从而创建视觉X-former (ViXs)。这导致GPU内存需求减少了七倍。我们还将它们的性能与FNet和多层感知器混频器进行了比较，这进一步降低了GPU内存需求。其次，我们在ViX中引入了一种对图像的归纳先验，用卷积层代替初始的线性嵌入层，在不增加模型大小的情况下显著提高了分类精度。再次，我们将ViT中可学习的1D位置嵌入替换为旋转位置嵌入(RoPE)，提高了相同模型尺寸下的分类精度。我们相信，结合这些变化可以使那些数据和计算资源有限的人能够使用变形器，从而使其民主化。

{"title":"Resource-efficient Hybrid X-formers for Vision","authors":"Pranav Jeevan, A. Sethi","doi":"10.1109/WACV51458.2022.00361","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00361","url":null,"abstract":"Although transformers have become the neural architectures of choice for natural language processing, they require orders of magnitude more training data, GPU memory, and computations in order to compete with convolutional neural networks for computer vision. The attention mechanism of transformers scales quadratically with the length of the input sequence, and unrolled images have long sequence lengths. Plus, transformers lack an inductive bias that is appropriate for images. We tested three modifications to vision transformer (ViT) architectures that address these shortcomings. Firstly, we alleviate the quadratic bottleneck by using linear attention mechanisms, called X-formers (such that, X ∈{Performer, Linformer, Nyströmformer}), thereby creating Vision X-formers (ViXs). This resulted in up to a seven times reduction in the GPU memory requirement. We also compared their performance with FNet and multi-layer perceptron mixers, which further reduced the GPU memory requirement. Secondly, we introduced an inductive prior for images by replacing the initial linear embedding layer by convolutional layers in ViX, which significantly increased classification accuracy without increasing the model size. Thirdly, we replaced the learnable 1D position embeddings in ViT with Rotary Position Embedding (RoPE), which increases the classification accuracy for the same model size. We believe that incorporating such changes can democratize transformers by making them accessible to those with limited data and computing resources.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125937198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Mobile based Human Identification using Forehead Creases: Application and Assessment under COVID-19 Masked Face Scenarios 基于前额折痕的移动人体识别:COVID-19蒙面场景下的应用与评估

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00128

Rohith J Bharadwaj, Gaurav Jaswal, A. Nigam, Kamlesh Tiwari

In the COVID-19 situation, face masks have become an essential part of our daily life. As mask occludes most prominent facial characteristics, it brings new challenges to the existing facial recognition systems. This paper presents an idea to consider forehead creases (under surprise facial expression) as a new biometric modality to authenticate mask-wearing faces. The forehead biometrics utilizes the creases and textural skin patterns appearing due to voluntary contraction of the forehead region as features. The proposed framework is an efficient and generalizable deep learning framework for forehead recognition. Face-selfie images are collected using smartphone’s frontal camera in an unconstrained environment with various indoor/outdoor realistic environments. Acquired forehead images are first subjected to a segmentation model that results in rectangular Region Of Interest (ROI’s). A set of convolutional feature maps are subsequently obtained using a backbone network. The primary embeddings are enriched using a dual attention network (DANet) to induce discriminative feature learning. The attention-empowered embeddings are then optimized using Large Margin Co-sine Loss (LMCL) followed by Focal Loss to update weights for inducting robust training and better feature discriminating capabilities. Our system is end-to-end and few-shot; thus, it is very efficient in memory requirements and recognition rate. Besides, we present a forehead image dataset (BITS-IITMandi-ForeheadCreases Images Database 1) that has been recorded in two sessions from 247 subjects containing a total of 4,964 selfie-face mask images. To the best of our knowledge, this is the first to date mobile-based fore-head dataset and is being made available along with the mobile application in the public domain. The proposed system has achieved high performance results in both closed-set, i.e., CRR of 99.08% and EER of 0.44% and open-set matching, i.e., CRR: 97.84%, EER: 12.40% which justifies the significance of using forehead as a biometric modality.

在新冠肺炎疫情下，口罩已成为我们日常生活中不可或缺的一部分。由于掩模遮挡了大多数突出的面部特征，给现有的人脸识别系统带来了新的挑战。本文提出了一种将额头皱纹(面部表情惊讶时)作为一种新的生物识别方式来识别戴面具的人脸。前额生物识别利用由于前额区域的自愿收缩而出现的折痕和纹理皮肤模式作为特征。该框架是一种高效、可推广的深度学习框架。使用智能手机的前置摄像头在一个不受约束的环境中收集面部自拍图像，该环境具有各种室内/室外现实环境。首先对获取的前额图像进行分割模型，得到矩形感兴趣区域(ROI)。随后使用骨干网络获得一组卷积特征映射。使用双重注意网络(DANet)来丰富初级嵌入，以诱导判别特征学习。然后使用大余量余弦损失(Large Margin co -sin Loss, LMCL)和焦点损失(Focal Loss)对注意力增强的嵌入进行优化，以更新权重，从而引入鲁棒训练和更好的特征识别能力。我们的系统是端到端、少射的;因此，它在内存需求和识别率方面非常有效。此外，我们还提供了一个前额图像数据集(bits - iitmandiis - foreheadcreases Images Database 1)，该数据集记录了247名受试者的两个会话，共包含4,964张自拍照面具图像。据我们所知，这是迄今为止第一个基于移动的前额数据集，并且正在与公共领域的移动应用程序一起提供。该系统在封闭集(CRR为99.08%，EER为0.44%)和开放集匹配(CRR为97.84%，EER为12.40%)均取得了良好的性能，证明了将额头作为生物识别模态的重要性。

{"title":"Mobile based Human Identification using Forehead Creases: Application and Assessment under COVID-19 Masked Face Scenarios","authors":"Rohith J Bharadwaj, Gaurav Jaswal, A. Nigam, Kamlesh Tiwari","doi":"10.1109/WACV51458.2022.00128","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00128","url":null,"abstract":"In the COVID-19 situation, face masks have become an essential part of our daily life. As mask occludes most prominent facial characteristics, it brings new challenges to the existing facial recognition systems. This paper presents an idea to consider forehead creases (under surprise facial expression) as a new biometric modality to authenticate mask-wearing faces. The forehead biometrics utilizes the creases and textural skin patterns appearing due to voluntary contraction of the forehead region as features. The proposed framework is an efficient and generalizable deep learning framework for forehead recognition. Face-selfie images are collected using smartphone’s frontal camera in an unconstrained environment with various indoor/outdoor realistic environments. Acquired forehead images are first subjected to a segmentation model that results in rectangular Region Of Interest (ROI’s). A set of convolutional feature maps are subsequently obtained using a backbone network. The primary embeddings are enriched using a dual attention network (DANet) to induce discriminative feature learning. The attention-empowered embeddings are then optimized using Large Margin Co-sine Loss (LMCL) followed by Focal Loss to update weights for inducting robust training and better feature discriminating capabilities. Our system is end-to-end and few-shot; thus, it is very efficient in memory requirements and recognition rate. Besides, we present a forehead image dataset (BITS-IITMandi-ForeheadCreases Images Database 1) that has been recorded in two sessions from 247 subjects containing a total of 4,964 selfie-face mask images. To the best of our knowledge, this is the first to date mobile-based fore-head dataset and is being made available along with the mobile application in the public domain. The proposed system has achieved high performance results in both closed-set, i.e., CRR of 99.08% and EER of 0.44% and open-set matching, i.e., CRR: 97.84%, EER: 12.40% which justifies the significance of using forehead as a biometric modality.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132176398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Semi-supervised Generalized VAE Framework for Abnormality Detection using One-Class Classification 基于单类分类的半监督广义VAE异常检测框架

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00137

Renuka Sharma, Satvik Mashkaria, Suyash P. Awate

Abnormality detection is a one-class classification (OCC) problem where the methods learn either a generative model of the inlier class (e.g., in the variants of kernel principal component analysis) or a decision boundary to encapsulate the inlier class (e.g., in the one-class variants of the support vector machine). Learning schemes for OCC typically train on data solely from the inlier class, but some recent OCC methods have proposed semi-supervised extensions that also leverage a small amount of training data from outlier classes. Other recent methods extend existing principles to employ deep neural network (DNN) models for learning (for the inlier class) either latent-space distributions or autoencoders, but not both. We propose a semi-supervised variational formulation, leveraging generalized-Gaussian (GG) models leading to data-adaptive, robust, and uncertainty-aware distribution modeling in both latent space and image space. We propose a reparameterization for sampling from the latent-space GG to enable backpropagation-based optimization. Results on many publicly available real-world image sets and a synthetic image set show the benefits of our method over existing methods.

异常检测是一个单类分类(OCC)问题，其中方法要么学习内叶类的生成模型(例如，在核主成分分析的变体中)，要么学习封装内叶类的决策边界(例如，在支持向量机的单类变体中)。OCC的学习方案通常只对来自离群类的数据进行训练，但最近一些OCC方法提出了半监督扩展，也利用了来自离群类的少量训练数据。其他最近的方法扩展了现有的原理，使用深度神经网络(DNN)模型来学习(对于早期类)潜在空间分布或自动编码器，但不是两者都使用。我们提出了一种半监督变分公式，利用广义高斯(GG)模型，在潜在空间和图像空间中实现数据自适应、鲁棒和不确定性感知的分布建模。我们提出了从潜在空间GG采样的重新参数化，以实现基于反向传播的优化。在许多公开可用的真实世界图像集和合成图像集上的结果显示了我们的方法比现有方法的优点。

{"title":"A Semi-supervised Generalized VAE Framework for Abnormality Detection using One-Class Classification","authors":"Renuka Sharma, Satvik Mashkaria, Suyash P. Awate","doi":"10.1109/WACV51458.2022.00137","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00137","url":null,"abstract":"Abnormality detection is a one-class classification (OCC) problem where the methods learn either a generative model of the inlier class (e.g., in the variants of kernel principal component analysis) or a decision boundary to encapsulate the inlier class (e.g., in the one-class variants of the support vector machine). Learning schemes for OCC typically train on data solely from the inlier class, but some recent OCC methods have proposed semi-supervised extensions that also leverage a small amount of training data from outlier classes. Other recent methods extend existing principles to employ deep neural network (DNN) models for learning (for the inlier class) either latent-space distributions or autoencoders, but not both. We propose a semi-supervised variational formulation, leveraging generalized-Gaussian (GG) models leading to data-adaptive, robust, and uncertainty-aware distribution modeling in both latent space and image space. We propose a reparameterization for sampling from the latent-space GG to enable backpropagation-based optimization. Results on many publicly available real-world image sets and a synthetic image set show the benefits of our method over existing methods.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134415905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1