2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)最新文献

英文中文

GraDual: Graph-based Dual-modal Representation for Image-Text Matching 渐进式:基于图的图像-文本匹配双模态表示

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00252

Siqu Long, S. Han, Xiaojun Wan, Josiah Poon

Image-text retrieval task is a challenging task. It aims to measure the visual-semantic correspondence between an image and a text caption. This is tough mainly because the image lacks semantic context information as in its corresponding text caption, and the text representation is very limited to fully describe the details of an image. In this paper, we introduce Graph-based Dual-modal Representations (GraDual), including Vision-Integrated Text Embedding (VITE) and Context-Integrated Visual Embedding (CIVE), for image-text retrieval. The GraDual improves the coverage of each modality by exploiting textual context semantics for the image representation, and using visual features as a guidance for the text representation. To be specific, we design: 1) a dual-modal graph representation mechanism to solve the lack of coverage issue for each modality. 2) an intermediate graph embedding integration strategy to enhance the important pattern across other modality global features. 3) a dual-modal driven crossmodal matching network to generate a filtered representation of another modality. Extensive experiments on two benchmark datasets, MS-COCO and Flickr30K, demonstrates the superiority of the proposed GraDual in comparison to state-of-the-art methods.

图像文本检索任务是一项具有挑战性的任务。它旨在测量图像和文本标题之间的视觉语义对应关系。这很困难，主要是因为图像缺乏相应文本标题中的语义上下文信息，并且文本表示非常有限，无法完全描述图像的细节。在本文中，我们引入了基于图的双模态表示(渐进)，包括视觉集成文本嵌入(VITE)和上下文集成视觉嵌入(CIVE)，用于图像-文本检索。渐进算法通过利用文本上下文语义进行图像表示，并使用视觉特征作为文本表示的指导，提高了每种模态的覆盖范围。具体来说，我们设计了:1)一个双峰图表示机制来解决每个模态缺乏覆盖的问题。2)中间图嵌入集成策略，增强重要模式跨越其他模态全局特征。3)双模态驱动的跨模态匹配网络，以生成另一模态的过滤表示。在MS-COCO和Flickr30K两个基准数据集上进行的大量实验表明，与最先进的方法相比，所提出的渐进式方法具有优越性。

{"title":"GraDual: Graph-based Dual-modal Representation for Image-Text Matching","authors":"Siqu Long, S. Han, Xiaojun Wan, Josiah Poon","doi":"10.1109/WACV51458.2022.00252","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00252","url":null,"abstract":"Image-text retrieval task is a challenging task. It aims to measure the visual-semantic correspondence between an image and a text caption. This is tough mainly because the image lacks semantic context information as in its corresponding text caption, and the text representation is very limited to fully describe the details of an image. In this paper, we introduce Graph-based Dual-modal Representations (GraDual), including Vision-Integrated Text Embedding (VITE) and Context-Integrated Visual Embedding (CIVE), for image-text retrieval. The GraDual improves the coverage of each modality by exploiting textual context semantics for the image representation, and using visual features as a guidance for the text representation. To be specific, we design: 1) a dual-modal graph representation mechanism to solve the lack of coverage issue for each modality. 2) an intermediate graph embedding integration strategy to enhance the important pattern across other modality global features. 3) a dual-modal driven crossmodal matching network to generate a filtered representation of another modality. Extensive experiments on two benchmark datasets, MS-COCO and Flickr30K, demonstrates the superiority of the proposed GraDual in comparison to state-of-the-art methods.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128209193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Consistent Cell Tracking in Multi-frames with Spatio-Temporal Context by Object-Level Warping Loss 基于对象级翘曲损失的多帧时空背景下的一致细胞跟踪

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00182

Junya Hayashida, Kazuya Nishimura, Ryoma Bise

Multi-object tracking is essential in biomedical image analysis. Most methods follow a tracking-by-detection approach that involves using object detectors and learning the appearance feature models of the detected regions for association. Although these methods can learn the appearance similarity features to identify the same objects among frames, they have difficulties identifying the same cells because cells have a similar appearance and their shapes change as they migrate. In addition, cells often partially overlap for several frames. In this case, even an expert biologist would require knowledge of the spatial-temporal context in order to identify individual cells. To tackle such difficult situations, we propose a cell-tracking method that can effectively use the spatial-temporal context in multiple frames by using long-term motion estimation and an object-level warping loss. We conducted experiments showing that the proposed method outperformed state-of-the-art methods under various conditions on real biological images.

多目标跟踪是生物医学图像分析的关键。大多数方法采用检测跟踪方法，包括使用对象检测器和学习被检测区域的外观特征模型进行关联。虽然这些方法可以通过学习外观相似性特征来识别帧间相同的对象，但是由于单元格具有相似的外观，并且它们的形状会随着迁移而变化，因此在识别相同的单元格时存在困难。此外，单元格通常在几个帧中部分重叠。在这种情况下，即使是专业的生物学家也需要了解时空背景，以便识别单个细胞。为了解决这种困难的情况，我们提出了一种细胞跟踪方法，该方法可以通过使用长期运动估计和对象级扭曲损失来有效地利用多帧中的时空上下文。我们进行的实验表明，在真实生物图像的各种条件下，所提出的方法优于最先进的方法。

引用次数: 3

RLSS: A Deep Reinforcement Learning Algorithm for Sequential Scene Generation 时序场景生成的深度强化学习算法

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00278

Azimkhon Ostonov, Peter Wonka, D. Michels

We present RLSS: a reinforcement learning algorithm for sequential scene generation. This is based on employing the proximal policy optimization (PPO) algorithm for generative problems. In particular, we consider how to effectively reduce the action space by including a greedy search algorithm in the learning process. Our experiments demonstrate that our method converges for a relatively large number of actions and learns to generate scenes with predefined design objectives. This approach is placing objects iteratively in the virtual scene. In each step, the network chooses which objects to place and selects positions which result in maximal reward. A high reward is assigned if the last action resulted in desired properties whereas the violation of constraints is penalized. We demonstrate the capability of our method to generate plausible and diverse scenes efficiently by solving indoor planning problems and generating Angry Birds levels.

我们提出了RLSS:一种用于顺序场景生成的强化学习算法。这是基于采用近端策略优化(PPO)算法的生成问题。特别地，我们考虑了如何通过在学习过程中加入贪婪搜索算法来有效地减少动作空间。我们的实验表明，我们的方法收敛于相对大量的动作，并学习生成具有预定义设计目标的场景。这种方法是在虚拟场景中迭代地放置对象。在每一步中，网络选择放置哪些物体，并选择产生最大奖励的位置。如果最后一个行动产生了期望的属性，那么就会获得高奖励，而违反约束则会受到惩罚。通过解决室内规划问题和生成愤怒的小鸟关卡，我们证明了我们的方法能够有效地生成可信和多样化的场景。

引用次数: 3

Registration of Human Point Set using Automatic Key Point Detection and Region-aware Features 基于自动关键点检测和区域感知特征的人体点集配准

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00232

A. Maharjan, Xiaohui Yuan

Non-rigid point set registration is challenging when point sets have large deformations and different numbers of points. Examples of such point sets include human point sets representing complex human poses captured by different types of depth cameras. In this work, we present a probabilistic, non-rigid registration method to deal with these issues. Two regularization terms are used: key point correspondences and local neighborhood preservation. Our method detects key points in the point sets based on geodesic distance. Correspondences are established using a new cluster-based, region-aware feature descriptor. This feature descriptor encodes the association of a cluster to the left-right (symmetry) or upper-lower regions of the point sets. We use the Stochastic Neighbor Embedding (SNE) constraint to preserve the local neighborhood of the point set. Experimental results on challenging 3D human poses demonstrate that our method outperforms the state-of-the-art methods. Our method achieved highly competitive performance with a slight increase of error by 3.9% in comparison with the method using manually specified key point correspondences.

当点集变形大且点数不同时，非刚性点集配准具有挑战性。此类点集的示例包括表示由不同类型的深度相机捕获的复杂人体姿势的人体点集。在这项工作中，我们提出了一种概率非刚性配准方法来处理这些问题。使用了两个正则化术语:关键点对应和局部邻域保存。我们的方法是基于测地线距离来检测点集中的关键点。使用新的基于聚类的区域感知特征描述符建立对应关系。这个特征描述符编码集群与点集的左右(对称)或上下区域的关联。我们使用随机邻居嵌入(SNE)约束来保持点集的局部邻域。在具有挑战性的3D人体姿势的实验结果表明，我们的方法优于最先进的方法。我们的方法获得了极具竞争力的性能，与使用手动指定关键点对应的方法相比，误差略微增加了3.9%。

{"title":"Registration of Human Point Set using Automatic Key Point Detection and Region-aware Features","authors":"A. Maharjan, Xiaohui Yuan","doi":"10.1109/WACV51458.2022.00232","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00232","url":null,"abstract":"Non-rigid point set registration is challenging when point sets have large deformations and different numbers of points. Examples of such point sets include human point sets representing complex human poses captured by different types of depth cameras. In this work, we present a probabilistic, non-rigid registration method to deal with these issues. Two regularization terms are used: key point correspondences and local neighborhood preservation. Our method detects key points in the point sets based on geodesic distance. Correspondences are established using a new cluster-based, region-aware feature descriptor. This feature descriptor encodes the association of a cluster to the left-right (symmetry) or upper-lower regions of the point sets. We use the Stochastic Neighbor Embedding (SNE) constraint to preserve the local neighborhood of the point set. Experimental results on challenging 3D human poses demonstrate that our method outperforms the state-of-the-art methods. Our method achieved highly competitive performance with a slight increase of error by 3.9% in comparison with the method using manually specified key point correspondences.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"11 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120816034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Self-Supervised Shape Alignment for Sports Field Registration 运动场地登记的自监督形状对准

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00382

F. Shi, P. Marchwica, J. A. G. Higuera, Michael Jamieson, Mehrsan Javan, P. Siva

This paper presents an end-to-end self-supervised learning approach for cross-modality image registration and homography estimation, with a particular emphasis on registering sports field templates onto broadcast videos as a practical application. Rather then using any pairwise labelled data for training, we propose a self-supervised data mining method to train the registration network with a natural image and its edge map. Using an iterative estimation process controlled by a score regression network (SRN) to measure the registration error, the network can learn to estimate any homography transformation regardless of how misaligned the image and the template is. We further show the benefits of using pretrained weights to finetune the network for sports field calibration with few training data. We demonstrate the effectiveness of our proposed method by applying it to real-world sports broadcast videos where we achieve state-of-the-art results and real-time processing.

本文提出了一种端到端的自监督学习方法，用于跨模态图像配准和单应性估计，并特别强调将运动场模板注册到广播视频中作为实际应用。我们提出了一种自监督数据挖掘方法来训练配准网络，而不是使用任何成对标记的数据进行训练。利用分数回归网络(SRN)控制的迭代估计过程来测量配准误差，该网络可以学习估计任何单应性变换，而不管图像和模板的错位程度如何。我们进一步展示了使用预训练权值来微调网络的好处，以便在训练数据较少的情况下进行运动场校准。我们通过将其应用于现实世界的体育广播视频来证明我们提出的方法的有效性，我们实现了最先进的结果和实时处理。

引用次数: 8

Re-Compose the Image by Evaluating the Crop on More Than Just a Score 通过评估不仅仅是一个分数的作物来重新构图图像

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00056

Yang Cheng, Qian Lin, J. Allebach

Image re-composition has always been regarded as one of the most important steps during the post-processing of a photo. The quality of an image re-composition mainly depends on a person’s taste in aesthetics, which is not an effortless task for those who have no abundant experience in photography. Besides, while re-composing one image does not require much of a person’s time, it could be quite time-consuming when there are hundreds of images to be recomposed. To solve these problems, we propose a method that automates the process of re-composing an image to the desired aspect ratio. Although there already exist many image re-composition methods, they only provide a score to their predicted best crop but fail to explain why the score is high or low. Conversely, we succeed in designing an explainable method by introducing a novel 10-layer aesthetic score map, which represents how the position of the saliency in the original uncropped image, relative to that of the crop region, contributes to the overall score of the crop, so that the crop is not just represented by a single score. We conducted experiments to show that the proposed score map boosts the performance of our algorithm, which achieves a state-of-the-art performance on both public and our own datasets.

图像重组一直被认为是照片后期处理中最重要的步骤之一。图像重组的质量主要取决于一个人的审美品味，对于没有丰富摄影经验的人来说，这不是一件轻松的事情。此外，虽然重新组合一张图像不需要很多人的时间，但当有数百张图像需要重新组合时，可能会相当耗时。为了解决这些问题，我们提出了一种自动重组图像到所需宽高比的方法。虽然已经存在许多图像重组方法，但它们只提供了预测最佳裁剪的分数，而无法解释分数高或低的原因。相反，我们成功地设计了一种可解释的方法，通过引入一个新的10层美学评分图，它表示原始未裁剪图像中显著性的位置，相对于作物区域的位置，如何贡献作物的总体得分，因此作物不仅仅是由一个分数来表示。我们进行的实验表明，提出的分数图提高了我们的算法的性能，在公共和我们自己的数据集上都达到了最先进的性能。

{"title":"Re-Compose the Image by Evaluating the Crop on More Than Just a Score","authors":"Yang Cheng, Qian Lin, J. Allebach","doi":"10.1109/WACV51458.2022.00056","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00056","url":null,"abstract":"Image re-composition has always been regarded as one of the most important steps during the post-processing of a photo. The quality of an image re-composition mainly depends on a person’s taste in aesthetics, which is not an effortless task for those who have no abundant experience in photography. Besides, while re-composing one image does not require much of a person’s time, it could be quite time-consuming when there are hundreds of images to be recomposed. To solve these problems, we propose a method that automates the process of re-composing an image to the desired aspect ratio. Although there already exist many image re-composition methods, they only provide a score to their predicted best crop but fail to explain why the score is high or low. Conversely, we succeed in designing an explainable method by introducing a novel 10-layer aesthetic score map, which represents how the position of the saliency in the original uncropped image, relative to that of the crop region, contributes to the overall score of the crop, so that the crop is not just represented by a single score. We conducted experiments to show that the proposed score map boosts the performance of our algorithm, which achieves a state-of-the-art performance on both public and our own datasets.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132002263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Cleaning Noisy Labels by Negative Ensemble Learning for Source-Free Unsupervised Domain Adaptation 基于负集成学习的无源无监督域自适应噪声标签清除

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00043

Waqar Ahmed, Pietro Morerio, Vittorio Murino

Conventional Unsupervised Domain Adaptation (UDA) methods presume source and target domain data to be simultaneously available during training. Such an assumption may not hold in practice, as source data is often inaccessible (e.g., due to privacy reasons). On the contrary, a pre-trained source model is usually available, which performs poorly on target due to the well-known domain shift problem. This translates into a significant amount of misclassifications, which can be interpreted as structured noise affecting the inferred target pseudo-labels. In this work, we cast UDA as a pseudo-label refinery problem in the challenging source-free scenario. We propose Negative Ensemble Learning (NEL) technique, a unified method for adaptive noise filtering and progressive pseudo-label refinement. NEL is devised to tackle noisy pseudo-labels by enhancing diversity in ensemble members with different stochastic (i) input augmentation and (ii) feedback. The latter is achieved by leveraging the novel concept of Disjoint Residual Labels, which allow propagating diverse information to the different members. Eventually, a single model is trained with the refined pseudo-labels, which leads to a robust performance on the target domain. Extensive experiments show that the proposed method achieves state-of-the-art performance on major UDA benchmarks, such as Digit5, PACS, Visda-C, and DomainNet, without using source data samples at all.

传统的无监督域自适应(UDA)方法假定源域和目标域数据在训练过程中同时可用。这种假设在实践中可能不成立，因为源数据通常是不可访问的(例如，由于隐私原因)。相反，通常使用预训练的源模型，由于众所周知的域移位问题，该模型对目标的性能较差。这转化为大量的错误分类，可以解释为影响推断目标伪标签的结构化噪声。在这项工作中，我们将UDA视为具有挑战性的无源代码场景中的伪标签精炼问题。我们提出负集成学习(NEL)技术，一种统一的自适应噪声滤波和渐进式伪标签细化方法。NEL旨在通过增强具有不同随机(i)输入增强和(ii)反馈的集成成员的多样性来处理噪声伪标签。后者是通过利用不相交残差标签的新概念实现的，该概念允许将不同的信息传播给不同的成员。最后，使用改进的伪标签训练单个模型，从而在目标域上获得鲁棒性能。大量的实验表明，所提出的方法在根本不使用源数据样本的情况下，在主要的UDA基准测试(如Digit5、PACS、Visda-C和DomainNet)上达到了最先进的性能。

{"title":"Cleaning Noisy Labels by Negative Ensemble Learning for Source-Free Unsupervised Domain Adaptation","authors":"Waqar Ahmed, Pietro Morerio, Vittorio Murino","doi":"10.1109/WACV51458.2022.00043","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00043","url":null,"abstract":"Conventional Unsupervised Domain Adaptation (UDA) methods presume source and target domain data to be simultaneously available during training. Such an assumption may not hold in practice, as source data is often inaccessible (e.g., due to privacy reasons). On the contrary, a pre-trained source model is usually available, which performs poorly on target due to the well-known domain shift problem. This translates into a significant amount of misclassifications, which can be interpreted as structured noise affecting the inferred target pseudo-labels. In this work, we cast UDA as a pseudo-label refinery problem in the challenging source-free scenario. We propose Negative Ensemble Learning (NEL) technique, a unified method for adaptive noise filtering and progressive pseudo-label refinement. NEL is devised to tackle noisy pseudo-labels by enhancing diversity in ensemble members with different stochastic (i) input augmentation and (ii) feedback. The latter is achieved by leveraging the novel concept of Disjoint Residual Labels, which allow propagating diverse information to the different members. Eventually, a single model is trained with the refined pseudo-labels, which leads to a robust performance on the target domain. Extensive experiments show that the proposed method achieves state-of-the-art performance on major UDA benchmarks, such as Digit5, PACS, Visda-C, and DomainNet, without using source data samples at all.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132793342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

AirCamRTM: Enhancing Vehicle Detection for Efficient Aerial Camera-based Road Traffic Monitoring AirCamRTM:增强基于航空摄像机的高效道路交通监控的车辆检测

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00349

R. Makrigiorgis, Nicolas Hadjittoouli, C. Kyrkou, T. Theocharides

Efficient road traffic monitoring is playing a fundamental role in successfully resolving traffic congestion in cities. Unmanned Aerial Vehicles (UAVs) or drones equipped with cameras are an attractive proposition to provide flexible and infrastructure-free traffic monitoring. However, real-time traffic monitoring from UAV imagery poses several challenges, due to the large image sizes and presence of non-relevant targets. In this paper, we propose the AirCam-RTM framework that combines road segmentation and vehicle detection to focus only on relevant vehicles, which as a result, improves the monitoring performance by ~2 × and provides ~ 18% accuracy improvement. Furthermore, through a real experimental setup we qualitatively evaluate the performance of the proposed approach, and also demonstrate how it can be used for real-time traffic monitoring using UAVs.

有效的道路交通监控在成功解决城市交通拥堵方面发挥着重要作用。配备摄像头的无人驾驶飞行器(uav)或无人驾驶飞机是提供灵活且无基础设施的交通监控的一个有吸引力的提议。然而，由于大图像尺寸和不相关目标的存在，来自无人机图像的实时交通监控提出了几个挑战。在本文中，我们提出了结合道路分割和车辆检测的AirCam-RTM框架，只关注相关车辆，从而将监测性能提高了约2倍，并提供了约18%的精度提高。此外，通过真实的实验设置，我们定性地评估了所提出的方法的性能，并演示了如何将其用于使用无人机的实时交通监控。

引用次数: 3

QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog 限定语:问题引导的自关注多模态融合网络，用于视听场景感知对话

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00256

Muchao Ye, Quanzeng You, Fenglong Ma

Audio video scene-aware dialog (AVSD) is a new but more challenging visual question answering (VQA) task because of the higher complexity of feature extraction and fusion brought by the additional modalities. Although recent methods have achieved early success in improving feature extraction technique for AVSD, the technique of feature fusion still needs further investigation. In this paper, inspired by the success of self-attention mechanism and the importance of understanding questions for VQA answering, we propose a question-guided self-attentive multi-modal fusion network (QUALIFIER) to improve the AVSD practice in the stage of feature fusion and answer generation. Specifically, after extracting features and learning a comprehensive feature for each modality, we first use the designed self-attentive multi-modal fusion (SMF) module to aggregate each feature with the correlated information learned from others. Later, by prioritizing the question feature, we concatenate it with each fused feature to guide the generation of a natural language response to the question. As for experimental results, QUALIFIER shows better performance than other baseline methods in the large-scale AVSD dataset named DSTC7. Additionally, the human evaluation and ablation study results also demonstrate the effectiveness of our network architecture.

音频视频场景感知对话(AVSD)是一种新的视觉问答(VQA)任务，由于附加模态带来了更高的特征提取和融合复杂性。虽然近年来的方法在改进AVSD的特征提取技术方面取得了初步的成功，但特征融合技术仍有待进一步研究。在本文中，受自注意机制的成功和理解问题对VQA回答的重要性的启发，我们提出了一个问题引导的自注意多模态融合网络(QUALIFIER)，以改进AVSD在特征融合和答案生成阶段的实践。具体而言，在提取特征并学习每个模态的综合特征后，我们首先使用设计的自关注多模态融合(SMF)模块将每个特征与从其他模态学习到的相关信息进行聚合。然后，通过对问题特征进行优先级排序，我们将其与每个融合的特征连接起来，以指导对问题的自然语言响应的生成。实验结果表明，在大规模AVSD数据集DSTC7上，QUALIFIER的性能优于其他基线方法。此外，人类评估和消融研究结果也证明了我们的网络架构的有效性。

{"title":"QUALIFIER: Question-Guided Self-Attentive Multimodal Fusion Network for Audio Visual Scene-Aware Dialog","authors":"Muchao Ye, Quanzeng You, Fenglong Ma","doi":"10.1109/WACV51458.2022.00256","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00256","url":null,"abstract":"Audio video scene-aware dialog (AVSD) is a new but more challenging visual question answering (VQA) task because of the higher complexity of feature extraction and fusion brought by the additional modalities. Although recent methods have achieved early success in improving feature extraction technique for AVSD, the technique of feature fusion still needs further investigation. In this paper, inspired by the success of self-attention mechanism and the importance of understanding questions for VQA answering, we propose a question-guided self-attentive multi-modal fusion network (QUALIFIER) to improve the AVSD practice in the stage of feature fusion and answer generation. Specifically, after extracting features and learning a comprehensive feature for each modality, we first use the designed self-attentive multi-modal fusion (SMF) module to aggregate each feature with the correlated information learned from others. Later, by prioritizing the question feature, we concatenate it with each fused feature to guide the generation of a natural language response to the question. As for experimental results, QUALIFIER shows better performance than other baseline methods in the large-scale AVSD dataset named DSTC7. Additionally, the human evaluation and ablation study results also demonstrate the effectiveness of our network architecture.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114205545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Fast Nonlinear Image Unblending 快速非线性图像解混

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00325

Daichi Horita, K. Aizawa, Ryohei Suzuki, Taizan Yonetsuji, Huachun Zhu

Nonlinear color blending, which is advanced blending indicated by blend modes such as "overlay" and "multiply," is extensively employed by digital creators to produce attractive visual effects. To enjoy such flexible editing modalities on existing bitmap images like photographs, however, creators need a fast nonlinear blending algorithm that decomposes an image into a set of semi-transparent layers. To address this issue, we propose a neural-network-based method for nonlinear decomposition of an input image into linear and nonlinear alpha layers that can be separately modified for editing purposes, based on the specified color palettes and blend modes. Experiments show that our proposed method achieves an inference speed 370 times faster than the state-of-the-art method of nonlinear image unblending, which uses computationally intensive iterative optimization. Furthermore, our reconstruction quality is higher or comparable than other methods, including linear blending models. In addition, we provide examples that apply our method to image editing with nonlinear blend modes.

以“叠加”、“正片叠底”等混合模式为代表的高级混合模式——非线性色彩混合，被数字创作者广泛用于制作吸引人的视觉效果。然而，为了在现有的位图图像(如照片)上享受这种灵活的编辑方式，创作者需要一种快速的非线性混合算法，将图像分解为一组半透明层。为了解决这个问题，我们提出了一种基于神经网络的方法，用于将输入图像非线性分解为线性和非线性alpha层，这些层可以根据指定的调色板和混合模式分别进行编辑修改。实验表明，该方法的推理速度比目前使用计算密集型迭代优化的非线性图像解混方法快370倍。此外，我们的重建质量比其他方法更高或相当，包括线性混合模型。此外，我们还提供了将我们的方法应用于具有非线性混合模式的图像编辑的示例。

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀