Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision最新文献_第3页

HiCo: Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining 超声视频模型预训练的层次对比学习

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-10 DOI: 10.48550/arXiv.2210.04477

Chunhui Zhang, Yixiong Chen, Li Liu, Qiong Liu, Xiaoping Zhou

The self-supervised ultrasound (US) video model pretraining can use a small amount of labeled data to achieve one of the most promising results on US diagnosis. However, it does not take full advantage of multi-level knowledge for learning deep neural networks (DNNs), and thus is difficult to learn transferable feature representations. This work proposes a hierarchical contrastive learning (HiCo) method to improve the transferability for the US video model pretraining. HiCo introduces both peer-level semantic alignment and cross-level semantic alignment to facilitate the interaction between different semantic levels, which can effectively accelerate the convergence speed, leading to better generalization and adaptation of the learned model. Additionally, a softened objective function is implemented by smoothing the hard labels, which can alleviate the negative effect caused by local similarities of images between different classes. Experiments with HiCo on five datasets demonstrate its favorable results over state-of-the-art approaches. The source code of this work is publicly available at https://github.com/983632847/HiCo.

自监督超声(US)视频模型预训练可以使用少量标记数据来实现US诊断中最有希望的结果之一。然而，它没有充分利用多层次知识来学习深度神经网络(dnn)，因此难以学习可转移的特征表示。本文提出了一种分层对比学习(HiCo)方法来提高美国视频模型预训练的可转移性。HiCo引入了对等层语义对齐和跨层语义对齐，促进了不同语义层之间的交互，有效加快了收敛速度，使学习模型具有更好的泛化和自适应能力。此外，通过对硬标签进行平滑处理，实现了目标函数的软化，减轻了不同类别之间图像局部相似带来的负面影响。在五个数据集上使用HiCo进行的实验表明，它比最先进的方法取得了良好的效果。这项工作的源代码可在https://github.com/983632847/HiCo上公开获得。

{"title":"HiCo: Hierarchical Contrastive Learning for Ultrasound Video Model Pretraining","authors":"Chunhui Zhang, Yixiong Chen, Li Liu, Qiong Liu, Xiaoping Zhou","doi":"10.48550/arXiv.2210.04477","DOIUrl":"https://doi.org/10.48550/arXiv.2210.04477","url":null,"abstract":"The self-supervised ultrasound (US) video model pretraining can use a small amount of labeled data to achieve one of the most promising results on US diagnosis. However, it does not take full advantage of multi-level knowledge for learning deep neural networks (DNNs), and thus is difficult to learn transferable feature representations. This work proposes a hierarchical contrastive learning (HiCo) method to improve the transferability for the US video model pretraining. HiCo introduces both peer-level semantic alignment and cross-level semantic alignment to facilitate the interaction between different semantic levels, which can effectively accelerate the convergence speed, leading to better generalization and adaptation of the learned model. Additionally, a softened objective function is implemented by smoothing the hard labels, which can alleviate the negative effect caused by local similarities of images between different classes. Experiments with HiCo on five datasets demonstrate its favorable results over state-of-the-art approaches. The source code of this work is publicly available at https://github.com/983632847/HiCo.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81227823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The Eyecandies Dataset for Unsupervised Multimodal Anomaly Detection and Localization 用于无监督多模态异常检测和定位的Eyecandies数据集

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-10 DOI: 10.48550/arXiv.2210.04570

L. Bonfiglioli, M. Toschi, Davide Silvestri, Nicola Fioraio, Daniele De Gregorio

We present Eyecandies, a novel synthetic dataset for unsupervised anomaly detection and localization. Photo-realistic images of procedurally generated candies are rendered in a controlled environment under multiple lightning conditions, also providing depth and normal maps in an industrial conveyor scenario. We make available anomaly-free samples for model training and validation, while anomalous instances with precise ground-truth annotations are provided only in the test set. The dataset comprises ten classes of candies, each showing different challenges, such as complex textures, self-occlusions and specularities. Furthermore, we achieve large intra-class variation by randomly drawing key parameters of a procedural rendering pipeline, which enables the creation of an arbitrary number of instances with photo-realistic appearance. Likewise, anomalies are injected into the rendering graph and pixel-wise annotations are automatically generated, overcoming human-biases and possible inconsistencies. We believe this dataset may encourage the exploration of original approaches to solve the anomaly detection task, e.g. by combining color, depth and normal maps, as they are not provided by most of the existing datasets. Indeed, in order to demonstrate how exploiting additional information may actually lead to higher detection performance, we show the results obtained by training a deep convolutional autoencoder to reconstruct different combinations of inputs.

我们提出了Eyecandies，一个新的用于无监督异常检测和定位的合成数据集。程序生成的糖果的逼真图像在多种闪电条件下的受控环境中呈现，还提供工业输送机场景中的深度和法线地图。我们为模型训练和验证提供了无异常的样本，而具有精确地基真值注释的异常实例仅在测试集中提供。该数据集包括十类糖果，每一类都有不同的挑战，如复杂的纹理、自遮挡和镜面。此外，我们通过随机绘制程序渲染管道的关键参数来实现大的类内变化，这使得创建具有照片逼真外观的任意数量的实例成为可能。同样，将异常情况注入到渲染图中，并自动生成逐像素的注释，从而克服了人为偏见和可能的不一致。我们相信这个数据集可以鼓励探索解决异常检测任务的原始方法，例如通过结合颜色，深度和法线图，因为大多数现有数据集都没有提供这些方法。事实上，为了证明如何利用附加信息实际上可能导致更高的检测性能，我们展示了通过训练深度卷积自编码器来重建不同的输入组合所获得的结果。

{"title":"The Eyecandies Dataset for Unsupervised Multimodal Anomaly Detection and Localization","authors":"L. Bonfiglioli, M. Toschi, Davide Silvestri, Nicola Fioraio, Daniele De Gregorio","doi":"10.48550/arXiv.2210.04570","DOIUrl":"https://doi.org/10.48550/arXiv.2210.04570","url":null,"abstract":"We present Eyecandies, a novel synthetic dataset for unsupervised anomaly detection and localization. Photo-realistic images of procedurally generated candies are rendered in a controlled environment under multiple lightning conditions, also providing depth and normal maps in an industrial conveyor scenario. We make available anomaly-free samples for model training and validation, while anomalous instances with precise ground-truth annotations are provided only in the test set. The dataset comprises ten classes of candies, each showing different challenges, such as complex textures, self-occlusions and specularities. Furthermore, we achieve large intra-class variation by randomly drawing key parameters of a procedural rendering pipeline, which enables the creation of an arbitrary number of instances with photo-realistic appearance. Likewise, anomalies are injected into the rendering graph and pixel-wise annotations are automatically generated, overcoming human-biases and possible inconsistencies. We believe this dataset may encourage the exploration of original approaches to solve the anomaly detection task, e.g. by combining color, depth and normal maps, as they are not provided by most of the existing datasets. Indeed, in order to demonstrate how exploiting additional information may actually lead to higher detection performance, we show the results obtained by training a deep convolutional autoencoder to reconstruct different combinations of inputs.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85911542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval 用于跨模态视频检索的ConTra (Con)text (Tra)变换器

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-09 DOI: 10.48550/arXiv.2210.04341

A. Fragomeni, Michael Wray, D. Damen

In this paper, we re-examine the task of cross-modal clip-sentence retrieval, where the clip is part of a longer untrimmed video. When the clip is short or visually ambiguous, knowledge of its local temporal context (i.e. surrounding video segments) can be used to improve the retrieval performance. We propose Context Transformer (ConTra); an encoder architecture that models the interaction between a video clip and its local temporal context in order to enhance its embedded representations. Importantly, we supervise the context transformer using contrastive losses in the cross-modal embedding space. We explore context transformers for video and text modalities. Results consistently demonstrate improved performance on three datasets: YouCook2, EPIC-KITCHENS and a clip-sentence version of ActivityNet Captions. Exhaustive ablation studies and context analysis show the efficacy of the proposed method.

在本文中，我们重新审视了跨模态剪辑-句子检索的任务，其中剪辑是较长未修剪视频的一部分。当剪辑很短或视觉上模糊时，可以使用其局部时间上下文(即周围的视频片段)的知识来提高检索性能。我们提出Context Transformer (ConTra);一种编码器架构，它对视频片段与其本地时间上下文之间的交互进行建模，以增强其嵌入式表示。重要的是，我们使用跨模态嵌入空间中的对比损失来监督上下文转换器。我们探索视频和文本模式的上下文转换器。结果一致表明，在三个数据集上的性能得到了改善:YouCook2、EPIC-KITCHENS和ActivityNet Captions的剪辑句版本。详尽的消融研究和背景分析表明了该方法的有效性。

引用次数: 0

A Differentiable Distance Approximation for Fairer Image Classification 一种更公平的图像分类的可微距离近似

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-09 DOI: 10.48550/arXiv.2210.04369

Nicholas Rosa, T. Drummond, Mehrtash Harandi

Naively trained AI models can be heavily biased. This can be particularly problematic when the biases involve legally or morally protected attributes such as ethnic background, age or gender. Existing solutions to this problem come at the cost of extra computation, unstable adversarial optimisation or have losses on the feature space structure that are disconnected from fairness measures and only loosely generalise to fairness. In this work we propose a differentiable approximation of the variance of demographics, a metric that can be used to measure the bias, or unfairness, in an AI model. Our approximation can be optimised alongside the regular training objective which eliminates the need for any extra models during training and directly improves the fairness of the regularised models. We demonstrate that our approach improves the fairness of AI models in varied task and dataset scenarios, whilst still maintaining a high level of classification accuracy. Code is available at https://bitbucket.org/nelliottrosa/base_fairness.

天真训练的人工智能模型可能存在严重偏见。当偏见涉及法律或道德上受保护的属性(如种族背景、年龄或性别)时，这尤其成问题。这个问题的现有解决方案需要额外的计算，不稳定的对抗性优化，或者在特征空间结构上有损失，这些特征空间结构与公平性度量无关，只能松散地概括为公平性。在这项工作中，我们提出了人口统计学方差的可微分近似值，这是一个可用于衡量人工智能模型中的偏差或不公平的指标。我们的近似可以与常规训练目标一起优化，这消除了训练过程中对任何额外模型的需要，并直接提高了正则化模型的公平性。我们证明了我们的方法在不同的任务和数据集场景中提高了人工智能模型的公平性，同时仍然保持了高水平的分类准确性。代码可从https://bitbucket.org/nelliottrosa/base_fairness获得。

引用次数: 0

Point Cloud Upsampling via Cascaded Refinement Network 通过级联细化网络的点云上采样

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-08 DOI: 10.48550/arXiv.2210.03942

Hang Du, Xuejun Yan, Jingjing Wang, Di Xie, Shiliang Pu

Point cloud upsampling focuses on generating a dense, uniform and proximity-to-surface point set. Most previous approaches accomplish these objectives by carefully designing a single-stage network, which makes it still challenging to generate a high-fidelity point distribution. Instead, upsampling point cloud in a coarse-to-fine manner is a decent solution. However, existing coarse-to-fine upsampling methods require extra training strategies, which are complicated and time-consuming during the training. In this paper, we propose a simple yet effective cascaded refinement network, consisting of three generation stages that have the same network architecture but achieve different objectives. Specifically, the first two upsampling stages generate the dense but coarse points progressively, while the last refinement stage further adjust the coarse points to a better position. To mitigate the learning conflicts between multiple stages and decrease the difficulty of regressing new points, we encourage each stage to predict the point offsets with respect to the input shape. In this manner, the proposed cascaded refinement network can be easily optimized without extra learning strategies. Moreover, we design a transformer-based feature extraction module to learn the informative global and local shape context. In inference phase, we can dynamically adjust the model efficiency and effectiveness, depending on the available computational resources. Extensive experiments on both synthetic and real-scanned datasets demonstrate that the proposed approach outperforms the existing state-of-the-art methods.

点云上采样的重点是生成密集、均匀和接近表面的点集。大多数先前的方法通过精心设计单级网络来实现这些目标，这使得生成高保真点分布仍然具有挑战性。相反，以一种从粗到细的方式对点云进行上采样是一种不错的解决方案。然而，现有的从粗到精的上采样方法需要额外的训练策略，训练过程复杂且耗时。在本文中，我们提出了一个简单而有效的级联优化网络，由具有相同网络架构但实现不同目标的三个代阶段组成。具体而言，前两个上采样阶段逐步生成密集但粗糙的点，最后一个细化阶段进一步将粗糙点调整到更好的位置。为了减轻多个阶段之间的学习冲突并降低回归新点的难度，我们鼓励每个阶段预测相对于输入形状的点偏移量。通过这种方式，所提出的级联优化网络可以很容易地进行优化，而无需额外的学习策略。此外，我们设计了一个基于变压器的特征提取模块来学习信息丰富的全局和局部形状上下文。在推理阶段，我们可以根据可用的计算资源动态调整模型的效率和有效性。在合成和真实扫描数据集上的大量实验表明，所提出的方法优于现有的最先进的方法。

{"title":"Point Cloud Upsampling via Cascaded Refinement Network","authors":"Hang Du, Xuejun Yan, Jingjing Wang, Di Xie, Shiliang Pu","doi":"10.48550/arXiv.2210.03942","DOIUrl":"https://doi.org/10.48550/arXiv.2210.03942","url":null,"abstract":"Point cloud upsampling focuses on generating a dense, uniform and proximity-to-surface point set. Most previous approaches accomplish these objectives by carefully designing a single-stage network, which makes it still challenging to generate a high-fidelity point distribution. Instead, upsampling point cloud in a coarse-to-fine manner is a decent solution. However, existing coarse-to-fine upsampling methods require extra training strategies, which are complicated and time-consuming during the training. In this paper, we propose a simple yet effective cascaded refinement network, consisting of three generation stages that have the same network architecture but achieve different objectives. Specifically, the first two upsampling stages generate the dense but coarse points progressively, while the last refinement stage further adjust the coarse points to a better position. To mitigate the learning conflicts between multiple stages and decrease the difficulty of regressing new points, we encourage each stage to predict the point offsets with respect to the input shape. In this manner, the proposed cascaded refinement network can be easily optimized without extra learning strategies. Moreover, we design a transformer-based feature extraction module to learn the informative global and local shape context. In inference phase, we can dynamically adjust the model efficiency and effectiveness, depending on the available computational resources. Extensive experiments on both synthetic and real-scanned datasets demonstrate that the proposed approach outperforms the existing state-of-the-art methods.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76194086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Multi-Scale Wavelet Transformer for Face Forgery Detection 用于人脸伪造检测的多尺度小波变换

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-08 DOI: 10.48550/arXiv.2210.03899

Jie Liu, Jingjing Wang, Peng Zhang, Chunmao Wang, Di Xie, Shiliang Pu

Currently, many face forgery detection methods aggregate spatial and frequency features to enhance the generalization ability and gain promising performance under the cross-dataset scenario. However, these methods only leverage one level frequency information which limits their expressive ability. To overcome these limitations, we propose a multi-scale wavelet transformer framework for face forgery detection. Specifically, to take full advantage of the multi-scale and multi-frequency wavelet representation, we gradually aggregate the multi-scale wavelet representation at different stages of the backbone network. To better fuse the frequency feature with the spatial features, frequency-based spatial attention is designed to guide the spatial feature extractor to concentrate more on forgery traces. Meanwhile, cross-modality attention is proposed to fuse the frequency features with the spatial features. These two attention modules are calculated through a unified transformer block for efficiency. A wide variety of experiments demonstrate that the proposed method is efficient and effective for both within and cross datasets.

目前，许多人脸伪造检测方法通过聚合空间特征和频率特征来增强人脸的泛化能力，并在跨数据集场景下取得了很好的效果。然而，这些方法只利用一个层次的频率信息，这限制了它们的表达能力。为了克服这些限制，我们提出了一种用于人脸伪造检测的多尺度小波变换框架。具体来说，为了充分发挥多尺度多频小波表示的优势，我们逐步将骨干网不同阶段的多尺度小波表示进行聚合。为了更好地将频率特征与空间特征融合在一起，设计了基于频率的空间注意，引导空间特征提取器更加专注于伪造痕迹。同时提出了交叉模态关注，将频率特征与空间特征融合在一起。这两个注意力模块通过一个统一的变压器块进行计算，以提高效率。各种各样的实验表明，所提出的方法是高效和有效的内部和跨数据集。

{"title":"Multi-Scale Wavelet Transformer for Face Forgery Detection","authors":"Jie Liu, Jingjing Wang, Peng Zhang, Chunmao Wang, Di Xie, Shiliang Pu","doi":"10.48550/arXiv.2210.03899","DOIUrl":"https://doi.org/10.48550/arXiv.2210.03899","url":null,"abstract":"Currently, many face forgery detection methods aggregate spatial and frequency features to enhance the generalization ability and gain promising performance under the cross-dataset scenario. However, these methods only leverage one level frequency information which limits their expressive ability. To overcome these limitations, we propose a multi-scale wavelet transformer framework for face forgery detection. Specifically, to take full advantage of the multi-scale and multi-frequency wavelet representation, we gradually aggregate the multi-scale wavelet representation at different stages of the backbone network. To better fuse the frequency feature with the spatial features, frequency-based spatial attention is designed to guide the spatial feature extractor to concentrate more on forgery traces. Meanwhile, cross-modality attention is proposed to fuse the frequency features with the spatial features. These two attention modules are calculated through a unified transformer block for efficiency. A wide variety of experiments demonstrate that the proposed method is efficient and effective for both within and cross datasets.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76415461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

PS-ARM: An End-to-End Attention-aware Relation Mixer Network for Person Search PS-ARM:一个端到端关注感知的人际关系混合器网络

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-07 DOI: 10.48550/arXiv.2210.03433

M. Fiaz, Hisham Cholakkal, Sanath Narayan, R. Anwer, F. Khan

Person search is a challenging problem with various real-world applications, that aims at joint person detection and re-identification of a query person from uncropped gallery images. Although, the previous study focuses on rich feature information learning, it is still hard to retrieve the query person due to the occurrence of appearance deformations and background distractors. In this paper, we propose a novel attention-aware relation mixer (ARM) module for person search, which exploits the global relation between different local regions within RoI of a person and make it robust against various appearance deformations and occlusion. The proposed ARM is composed of a relation mixer block and a spatio-channel attention layer. The relation mixer block introduces a spatially attended spatial mixing and a channel-wise attended channel mixing for effectively capturing discriminative relation features within an RoI. These discriminative relation features are further enriched by introducing a spatio-channel attention where the foreground and background discriminability is empowered in a joint spatio-channel space. Our ARM module is generic and it does not rely on fine-grained supervision or topological assumptions, hence being easily integrated into any Faster R-CNN based person search methods. Comprehensive experiments are performed on two challenging benchmark datasets: CUHKSYSU and PRW. Our PS-ARM achieves state-of-the-art performance on both datasets. On the challenging PRW dataset, our PS-ARM achieves an absolute gain of 5 in the mAP score over SeqNet, while operating at a comparable speed.

人物搜索是各种现实应用中的一个具有挑战性的问题，其目的是联合人员检测和从未裁剪的图库图像中重新识别查询人员。虽然以往的研究侧重于丰富特征信息的学习，但由于存在外观变形和背景干扰因素，仍然难以检索到查询人。在本文中，我们提出了一种新的关注感知关系混合器(ARM)模块用于人物搜索，该模块利用人的RoI内不同局部区域之间的全局关系，使其对各种外观变形和遮挡具有鲁棒性。该ARM由一个关系混频器块和一个空间信道注意层组成。关系混频器块引入了空间参与的空间混合和通道参与的通道混合，用于有效捕获RoI内的判别关系特征。通过引入空间通道注意，在联合空间通道空间中赋予前景和背景可辨别性，进一步丰富了这些区别关系特征。我们的ARM模块是通用的，它不依赖于细粒度的监督或拓扑假设，因此很容易集成到任何更快的基于R-CNN的人员搜索方法中。在两个具有挑战性的基准数据集上进行了全面的实验:中大中山大学和PRW。我们的PS-ARM在这两个数据集上都实现了最先进的性能。在具有挑战性的PRW数据集上，我们的PS-ARM在mAP得分上比SeqNet获得了5分的绝对增益，同时以相当的速度运行。

{"title":"PS-ARM: An End-to-End Attention-aware Relation Mixer Network for Person Search","authors":"M. Fiaz, Hisham Cholakkal, Sanath Narayan, R. Anwer, F. Khan","doi":"10.48550/arXiv.2210.03433","DOIUrl":"https://doi.org/10.48550/arXiv.2210.03433","url":null,"abstract":"Person search is a challenging problem with various real-world applications, that aims at joint person detection and re-identification of a query person from uncropped gallery images. Although, the previous study focuses on rich feature information learning, it is still hard to retrieve the query person due to the occurrence of appearance deformations and background distractors. In this paper, we propose a novel attention-aware relation mixer (ARM) module for person search, which exploits the global relation between different local regions within RoI of a person and make it robust against various appearance deformations and occlusion. The proposed ARM is composed of a relation mixer block and a spatio-channel attention layer. The relation mixer block introduces a spatially attended spatial mixing and a channel-wise attended channel mixing for effectively capturing discriminative relation features within an RoI. These discriminative relation features are further enriched by introducing a spatio-channel attention where the foreground and background discriminability is empowered in a joint spatio-channel space. Our ARM module is generic and it does not rely on fine-grained supervision or topological assumptions, hence being easily integrated into any Faster R-CNN based person search methods. Comprehensive experiments are performed on two challenging benchmark datasets: CUHKSYSU and PRW. Our PS-ARM achieves state-of-the-art performance on both datasets. On the challenging PRW dataset, our PS-ARM achieves an absolute gain of 5 in the mAP score over SeqNet, while operating at a comparable speed.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89310453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition 基于骨架的动作识别的焦点和全局时空转换器

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-06 DOI: 10.48550/arXiv.2210.02693

Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qi-dong Liu, Pichao Wang, Mingliang Xu, Wanqing Li

Despite great progress achieved by transformer in various vision tasks, it is still underexplored for skeleton-based action recognition with only a few attempts. Besides, these methods directly calculate the pair-wise global self-attention equally for all the joints in both the spatial and temporal dimensions, undervaluing the effect of discriminative local joints and the short-range temporal dynamics. In this work, we propose a novel Focal and Global Spatial-Temporal Transformer network (FG-STFormer), that is equipped with two key components: (1) FG-SFormer: focal joints and global parts coupling spatial transformer. It forces the network to focus on modelling correlations for both the learned discriminative spatial joints and human body parts respectively. The selective focal joints eliminate the negative effect of non-informative ones during accumulating the correlations. Meanwhile, the interactions between the focal joints and body parts are incorporated to enhance the spatial dependencies via mutual cross-attention. (2) FG-TFormer: focal and global temporal transformer. Dilated temporal convolution is integrated into the global self-attention mechanism to explicitly capture the local temporal motion patterns of joints or body parts, which is found to be vital important to make temporal transformer work. Extensive experimental results on three benchmarks, namely NTU-60, NTU-120 and NW-UCLA, show our FG-STFormer surpasses all existing transformer-based methods, and compares favourably with state-of-the art GCN-based methods.

尽管变压器在各种视觉任务中取得了很大的进步，但它在基于骨骼的动作识别方面的探索仍然不足，只有很少的尝试。此外，这些方法在空间和时间维度上直接计算所有节点的成对全局自关注，低估了判别性局部节点和短期时间动态的影响。在这项工作中，我们提出了一个新的焦点和全局时空变压器网络(FG-STFormer)，它配备了两个关键组件:(1)FG-SFormer:焦点关节和全局部件耦合空间变压器。它迫使网络分别关注学习到的判别空间关节和人体部位的建模相关性。选择性焦点节点消除了非信息节点在累积相关时的负面影响。同时，结合焦点关节与身体部位的相互作用，通过相互交叉关注来增强空间依赖性。(2) FG-TFormer: focal和global temporal transformer。将扩展时间卷积整合到全局自注意机制中，以明确捕获关节或身体部位的局部时间运动模式，这对于使时间转换器工作至关重要。在NTU-60, NTU-120和NW-UCLA三个基准上的广泛实验结果表明，我们的FG-STFormer超越了所有现有的基于变压器的方法，并与最先进的基于gcn的方法相比较。

{"title":"Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition","authors":"Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qi-dong Liu, Pichao Wang, Mingliang Xu, Wanqing Li","doi":"10.48550/arXiv.2210.02693","DOIUrl":"https://doi.org/10.48550/arXiv.2210.02693","url":null,"abstract":"Despite great progress achieved by transformer in various vision tasks, it is still underexplored for skeleton-based action recognition with only a few attempts. Besides, these methods directly calculate the pair-wise global self-attention equally for all the joints in both the spatial and temporal dimensions, undervaluing the effect of discriminative local joints and the short-range temporal dynamics. In this work, we propose a novel Focal and Global Spatial-Temporal Transformer network (FG-STFormer), that is equipped with two key components: (1) FG-SFormer: focal joints and global parts coupling spatial transformer. It forces the network to focus on modelling correlations for both the learned discriminative spatial joints and human body parts respectively. The selective focal joints eliminate the negative effect of non-informative ones during accumulating the correlations. Meanwhile, the interactions between the focal joints and body parts are incorporated to enhance the spatial dependencies via mutual cross-attention. (2) FG-TFormer: focal and global temporal transformer. Dilated temporal convolution is integrated into the global self-attention mechanism to explicitly capture the local temporal motion patterns of joints or body parts, which is found to be vital important to make temporal transformer work. Extensive experimental results on three benchmarks, namely NTU-60, NTU-120 and NW-UCLA, show our FG-STFormer surpasses all existing transformer-based methods, and compares favourably with state-of-the art GCN-based methods.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88447890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Compressed Vision for Efficient Video Understanding 高效视频理解的压缩视觉

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-06 DOI: 10.48550/arXiv.2210.02995

Olivia Wiles, J. Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski

Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.

经验和推理发生在多个时间尺度上:毫秒、秒、小时或天。然而，绝大多数计算机视觉研究仍然集中在只有几秒钟的单个图像或短视频上。这是因为处理较长的视频甚至需要更多的可伸缩方法来处理它们。在这项工作中，我们提出了一个框架，可以使用相同的硬件来研究一小时长的视频，现在可以处理第二小时长的视频。我们用神经压缩取代了标准的视频压缩，例如JPEG，并表明我们可以直接将压缩的视频作为常规视频网络的输入。在压缩视频上操作可以提高所有管道级别的效率-数据传输，速度和内存-使得可以在更长的视频上更快地训练模型。然而，处理压缩信号的缺点是，如果单纯地进行处理，就会排除标准增强技术。我们通过引入一个小型网络来解决这个问题，该网络可以将转换应用于与原始视频空间中常用增强相对应的潜在代码。我们证明，通过我们的压缩视觉管道，我们可以在Kinetics600和COIN等流行基准上更有效地训练视频模型。我们还在标准帧率下对长达一小时的视频定义新任务进行概念验证实验。如果不使用压缩表示，处理如此长的视频是不可能的。

{"title":"Compressed Vision for Efficient Video Understanding","authors":"Olivia Wiles, J. Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski","doi":"10.48550/arXiv.2210.02995","DOIUrl":"https://doi.org/10.48550/arXiv.2210.02995","url":null,"abstract":"Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them. In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks. Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space. We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87621842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal 密集非均匀除雾的结构表示网络和不确定性反馈学习

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-06 DOI: 10.48550/arXiv.2210.03061

Yeying Jin, Wending Yan, Wenhan Yang, R. Tan

Few existing image defogging or dehazing methods consider dense and non-uniform particle distributions, which usually happen in smoke, dust and fog. Dealing with these dense and/or non-uniform distributions can be intractable, since fog's attenuation and airlight (or veiling effect) significantly weaken the background scene information in the input image. To address this problem, we introduce a structure-representation network with uncertainty feedback learning. Specifically, we extract the feature representations from a pre-trained Vision Transformer (DINO-ViT) module to recover the background information. To guide our network to focus on non-uniform fog areas, and then remove the fog accordingly, we introduce the uncertainty feedback learning, which produces the uncertainty maps, that have higher uncertainty in denser fog regions, and can be regarded as an attention map that represents fog's density and uneven distribution. Based on the uncertainty map, our feedback network refines our defogged output iteratively. Moreover, to handle the intractability of estimating the atmospheric light colors, we exploit the grayscale version of our input image, since it is less affected by varying light colors that are possibly present in the input image. The experimental results demonstrate the effectiveness of our method both quantitatively and qualitatively compared to the state-of-the-art methods in handling dense and non-uniform fog or smoke.

现有的图像除雾或除雾方法很少考虑到通常发生在烟、尘和雾中的密集和不均匀的颗粒分布。处理这些密集和/或非均匀分布可能是棘手的，因为雾的衰减和空气光(或遮蔽效应)显着削弱了输入图像中的背景场景信息。为了解决这个问题，我们引入了一个具有不确定性反馈学习的结构-表示网络。具体而言，我们从预训练的视觉转换器(DINO-ViT)模块中提取特征表示以恢复背景信息。为了引导我们的网络关注不均匀的雾区，然后相应地去除雾，我们引入了不确定性反馈学习，产生的不确定性图在雾较浓的区域具有更高的不确定性，可以看作是代表雾的密度和不均匀分布的注意图。基于不确定性映射，我们的反馈网络迭代地改进我们的去雾输出。此外，为了处理估计大气光色的棘手问题，我们利用了输入图像的灰度版本，因为它受输入图像中可能存在的不同光色的影响较小。实验结果表明，在处理密集和不均匀的雾或烟方面，与最先进的方法相比，我们的方法在定量和定性上都是有效的。

{"title":"Structure Representation Network and Uncertainty Feedback Learning for Dense Non-Uniform Fog Removal","authors":"Yeying Jin, Wending Yan, Wenhan Yang, R. Tan","doi":"10.48550/arXiv.2210.03061","DOIUrl":"https://doi.org/10.48550/arXiv.2210.03061","url":null,"abstract":"Few existing image defogging or dehazing methods consider dense and non-uniform particle distributions, which usually happen in smoke, dust and fog. Dealing with these dense and/or non-uniform distributions can be intractable, since fog's attenuation and airlight (or veiling effect) significantly weaken the background scene information in the input image. To address this problem, we introduce a structure-representation network with uncertainty feedback learning. Specifically, we extract the feature representations from a pre-trained Vision Transformer (DINO-ViT) module to recover the background information. To guide our network to focus on non-uniform fog areas, and then remove the fog accordingly, we introduce the uncertainty feedback learning, which produces the uncertainty maps, that have higher uncertainty in denser fog regions, and can be regarded as an attention map that represents fog's density and uneven distribution. Based on the uncertainty map, our feedback network refines our defogged output iteratively. Moreover, to handle the intractability of estimating the atmospheric light colors, we exploit the grayscale version of our input image, since it is less affected by varying light colors that are possibly present in the input image. The experimental results demonstrate the effectiveness of our method both quantitatively and qualitatively compared to the state-of-the-art methods in handling dense and non-uniform fog or smoke.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91022873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14