Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision最新文献_第4页

Learning Video-independent Eye Contact Segmentation from In-the-Wild Videos 从野外视频中学习与视频无关的目光接触分割

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-05 DOI: 10.48550/arXiv.2210.02033

Tianyi Wu, Yusuke Sugano

Human eye contact is a form of non-verbal communication and can have a great influence on social behavior. Since the location and size of the eye contact targets vary across different videos, learning a generic video-independent eye contact detector is still a challenging task. In this work, we address the task of one-way eye contact detection for videos in the wild. Our goal is to build a unified model that can identify when a person is looking at his gaze targets in an arbitrary input video. Considering that this requires time-series relative eye movement information, we propose to formulate the task as a temporal segmentation. Due to the scarcity of labeled training data, we further propose a gaze target discovery method to generate pseudo-labels for unlabeled videos, which allows us to train a generic eye contact segmentation model in an unsupervised way using in-the-wild videos. To evaluate our proposed approach, we manually annotated a test dataset consisting of 52 videos of human conversations. Experimental results show that our eye contact segmentation model outperforms the previous video-dependent eye contact detector and can achieve 71.88% framewise accuracy on our annotated test set. Our code and evaluation dataset are available at https://github.com/ut-vision/Video-Independent-ECS.

人的眼神交流是一种非语言交流，对社会行为有很大的影响。由于不同视频中目光接触目标的位置和大小不同，学习一个通用的与视频无关的目光接触检测器仍然是一项具有挑战性的任务。在这项工作中，我们解决了在野外视频中进行单向眼神接触检测的任务。我们的目标是建立一个统一的模型，可以识别一个人何时在任意输入的视频中看着他的凝视目标。考虑到这需要时间序列的相对眼动信息，我们建议将任务表述为时间分割。由于标记训练数据的稀缺性，我们进一步提出了一种凝视目标发现方法来为未标记的视频生成伪标签，这使得我们能够使用野外视频以无监督的方式训练通用的目光接触分割模型。为了评估我们提出的方法，我们手动注释了一个由52个人类对话视频组成的测试数据集。实验结果表明，我们的目光接触分割模型优于之前的依赖于视频的目光接触检测器，在我们的注释测试集上可以达到71.88%的帧精度。我们的代码和评估数据集可在https://github.com/ut-vision/Video-Independent-ECS上获得。

{"title":"Learning Video-independent Eye Contact Segmentation from In-the-Wild Videos","authors":"Tianyi Wu, Yusuke Sugano","doi":"10.48550/arXiv.2210.02033","DOIUrl":"https://doi.org/10.48550/arXiv.2210.02033","url":null,"abstract":"Human eye contact is a form of non-verbal communication and can have a great influence on social behavior. Since the location and size of the eye contact targets vary across different videos, learning a generic video-independent eye contact detector is still a challenging task. In this work, we address the task of one-way eye contact detection for videos in the wild. Our goal is to build a unified model that can identify when a person is looking at his gaze targets in an arbitrary input video. Considering that this requires time-series relative eye movement information, we propose to formulate the task as a temporal segmentation. Due to the scarcity of labeled training data, we further propose a gaze target discovery method to generate pseudo-labels for unlabeled videos, which allows us to train a generic eye contact segmentation model in an unsupervised way using in-the-wild videos. To evaluate our proposed approach, we manually annotated a test dataset consisting of 52 videos of human conversations. Experimental results show that our eye contact segmentation model outperforms the previous video-dependent eye contact detector and can achieve 71.88% framewise accuracy on our annotated test set. Our code and evaluation dataset are available at https://github.com/ut-vision/Video-Independent-ECS.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81185808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Two Video Data Sets for Tracking and Retrieval of Out of Distribution Objects 两种用于非分布对象跟踪和检索的视频数据集

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-05 DOI: 10.48550/arXiv.2210.02074

Kira Maag, Robin Chan, Svenja Uhlemeyer, K. Kowol, H. Gottschalk

In this work we present two video test data sets for the novel computer vision (CV) task of out of distribution tracking (OOD tracking). Here, OOD objects are understood as objects with a semantic class outside the semantic space of an underlying image segmentation algorithm, or an instance within the semantic space which however looks decisively different from the instances contained in the training data. OOD objects occurring on video sequences should be detected on single frames as early as possible and tracked over their time of appearance as long as possible. During the time of appearance, they should be segmented as precisely as possible. We present the SOS data set containing 20 video sequences of street scenes and more than 1000 labeled frames with up to two OOD objects. We furthermore publish the synthetic CARLA-WildLife data set that consists of 26 video sequences containing up to four OOD objects on a single frame. We propose metrics to measure the success of OOD tracking and develop a baseline algorithm that efficiently tracks the OOD objects. As an application that benefits from OOD tracking, we retrieve OOD sequences from unlabeled videos of street scenes containing OOD objects.

在这项工作中，我们提出了两个视频测试数据集，用于新的计算机视觉(CV)任务的偏离分布跟踪(OOD跟踪)。在这里，OOD对象被理解为在底层图像分割算法的语义空间之外具有语义类的对象，或者是语义空间内的实例，但看起来与训练数据中包含的实例完全不同。在视频序列中出现的OOD对象应该尽可能早地在单帧中检测到，并尽可能长时间地跟踪它们的出现时间。在出现的时候，它们应该被尽可能精确地分割。我们提出了SOS数据集，其中包含20个街景视频序列和1000多个带有最多两个OOD对象的标记帧。我们进一步发布了合成的CARLA-WildLife数据集，该数据集由26个视频序列组成，其中每帧最多包含4个OOD对象。我们提出了衡量OOD跟踪成功的指标，并开发了一个有效跟踪OOD对象的基线算法。作为一个受益于OOD跟踪的应用程序，我们从包含OOD对象的未标记街景视频中检索OOD序列。

{"title":"Two Video Data Sets for Tracking and Retrieval of Out of Distribution Objects","authors":"Kira Maag, Robin Chan, Svenja Uhlemeyer, K. Kowol, H. Gottschalk","doi":"10.48550/arXiv.2210.02074","DOIUrl":"https://doi.org/10.48550/arXiv.2210.02074","url":null,"abstract":"In this work we present two video test data sets for the novel computer vision (CV) task of out of distribution tracking (OOD tracking). Here, OOD objects are understood as objects with a semantic class outside the semantic space of an underlying image segmentation algorithm, or an instance within the semantic space which however looks decisively different from the instances contained in the training data. OOD objects occurring on video sequences should be detected on single frames as early as possible and tracked over their time of appearance as long as possible. During the time of appearance, they should be segmented as precisely as possible. We present the SOS data set containing 20 video sequences of street scenes and more than 1000 labeled frames with up to two OOD objects. We furthermore publish the synthetic CARLA-WildLife data set that consists of 26 video sequences containing up to four OOD objects on a single frame. We propose metrics to measure the success of OOD tracking and develop a baseline algorithm that efficiently tracks the OOD objects. As an application that benefits from OOD tracking, we retrieve OOD sequences from unlabeled videos of street scenes containing OOD objects.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79743317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Decanus to Legatus: Synthetic training for 2D-3D human pose lifting Decanus到Legatus: 2D-3D人体姿势提升的合成训练

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-05 DOI: 10.48550/arXiv.2210.02231

Yue Zhu, David Picard

3D human pose estimation is a challenging task because of the difficulty to acquire ground-truth data outside of controlled environments. A number of further issues have been hindering progress in building a universal and robust model for this task, including domain gaps between different datasets, unseen actions between train and test datasets, various hardware settings and high cost of annotation, etc. In this paper, we propose an algorithm to generate infinite 3D synthetic human poses (Legatus) from a 3D pose distribution based on 10 initial handcrafted 3D poses (Decanus) during the training of a 2D to 3D human pose lifter neural network. Our results show that we can achieve 3D pose estimation performance comparable to methods using real data from specialized datasets but in a zero-shot setup, showing the generalization potential of our framework.

三维人体姿态估计是一项具有挑战性的任务，因为很难获得受控环境之外的真实数据。一些进一步的问题阻碍了为这项任务建立一个通用和健壮的模型的进展，包括不同数据集之间的域差距，训练和测试数据集之间看不见的动作，各种硬件设置和高成本的注释等。在本文中，我们提出了一种算法，在2D到3D人体姿势提升神经网络的训练过程中，基于10个初始手工制作的3D姿势(Decanus)，从3D姿势分布生成无限的3D合成人体姿势(Legatus)。我们的结果表明，我们可以实现与使用来自专门数据集的真实数据的方法相当的3D姿态估计性能，但在零射击设置中，显示了我们框架的泛化潜力。

引用次数: 0

Multi-stream Fusion for Class Incremental Learning in Pill Image Classification 多流融合在药丸图像分类中的类增量学习

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-05 DOI: 10.48550/arXiv.2210.02313

Trong T. Nguyen, Hieu Pham, Phi-Le Nguyen, T. Nguyen, Minh N. Do

Classifying pill categories from real-world images is crucial for various smart healthcare applications. Although existing approaches in image classification might achieve a good performance on fixed pill categories, they fail to handle novel instances of pill categories that are frequently presented to the learning algorithm. To this end, a trivial solution is to train the model with novel classes. However, this may result in a phenomenon known as catastrophic forgetting, in which the system forgets what it learned in previous classes. In this paper, we address this challenge by introducing the class incremental learning (CIL) ability to traditional pill image classification systems. Specifically, we propose a novel incremental multi-stream intermediate fusion framework enabling incorporation of an additional guidance information stream that best matches the domain of the problem into various state-of-the-art CIL methods. From this framework, we consider color-specific information of pill images as a guidance stream and devise an approach, namely"Color Guidance with Multi-stream intermediate fusion"(CG-IMIF) for solving CIL pill image classification task. We conduct comprehensive experiments on real-world incremental pill image classification dataset, namely VAIPE-PCIL, and find that the CG-IMIF consistently outperforms several state-of-the-art methods by a large margin in different task settings. Our code, data, and trained model are available at https://github.com/vinuni-vishc/CG-IMIF.

从真实图像中分类药丸类别对于各种智能医疗保健应用程序至关重要。尽管现有的图像分类方法可以在固定的药丸类别上取得良好的性能，但它们无法处理频繁出现在学习算法中的药丸类别的新实例。为此，一个简单的解决方案是使用新颖的类来训练模型。然而，这可能会导致一种被称为灾难性遗忘的现象，在这种现象中，系统忘记了它在以前的课程中学到的东西。在本文中，我们通过将类增量学习(CIL)能力引入传统的药丸图像分类系统来解决这一挑战。具体来说，我们提出了一种新的增量多流中间融合框架，能够将最适合问题领域的附加引导信息流合并到各种最先进的CIL方法中。在此框架下，我们将药丸图像的特定颜色信息作为一个引导流，设计了一种“多流中间融合的颜色指导”(CG-IMIF)方法来解决CIL药丸图像分类任务。我们在真实世界的增量药丸图像分类数据集(即VAIPE-PCIL)上进行了全面的实验，发现CG-IMIF在不同的任务设置中始终优于几种最先进的方法。我们的代码、数据和经过训练的模型可在https://github.com/vinuni-vishc/CG-IMIF上获得。

{"title":"Multi-stream Fusion for Class Incremental Learning in Pill Image Classification","authors":"Trong T. Nguyen, Hieu Pham, Phi-Le Nguyen, T. Nguyen, Minh N. Do","doi":"10.48550/arXiv.2210.02313","DOIUrl":"https://doi.org/10.48550/arXiv.2210.02313","url":null,"abstract":"Classifying pill categories from real-world images is crucial for various smart healthcare applications. Although existing approaches in image classification might achieve a good performance on fixed pill categories, they fail to handle novel instances of pill categories that are frequently presented to the learning algorithm. To this end, a trivial solution is to train the model with novel classes. However, this may result in a phenomenon known as catastrophic forgetting, in which the system forgets what it learned in previous classes. In this paper, we address this challenge by introducing the class incremental learning (CIL) ability to traditional pill image classification systems. Specifically, we propose a novel incremental multi-stream intermediate fusion framework enabling incorporation of an additional guidance information stream that best matches the domain of the problem into various state-of-the-art CIL methods. From this framework, we consider color-specific information of pill images as a guidance stream and devise an approach, namely\"Color Guidance with Multi-stream intermediate fusion\"(CG-IMIF) for solving CIL pill image classification task. We conduct comprehensive experiments on real-world incremental pill image classification dataset, namely VAIPE-PCIL, and find that the CG-IMIF consistently outperforms several state-of-the-art methods by a large margin in different task settings. Our code, data, and trained model are available at https://github.com/vinuni-vishc/CG-IMIF.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73966541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

APAUNet: Axis Projection Attention UNet for Small Target in 3D Medical Segmentation apunet:三维医学分割中小目标的轴投影注意网

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-04 DOI: 10.48550/arXiv.2210.01485

Yuncheng Jiang, Zixun Zhang, Shixi Qin, Yao Guo, Zhuguo Li, Shuguang Cui

In 3D medical image segmentation, small targets segmentation is crucial for diagnosis but still faces challenges. In this paper, we propose the Axis Projection Attention UNet, named APAUNet, for 3D medical image segmentation, especially for small targets. Considering the large proportion of the background in the 3D feature space, we introduce a projection strategy to project the 3D features into three orthogonal 2D planes to capture the contextual attention from different views. In this way, we can filter out the redundant feature information and mitigate the loss of critical information for small lesions in 3D scans. Then we utilize a dimension hybridization strategy to fuse the 3D features with attention from different axes and merge them by a weighted summation to adaptively learn the importance of different perspectives. Finally, in the APA Decoder, we concatenate both high and low resolution features in the 2D projection process, thereby obtaining more precise multi-scale information, which is vital for small lesion segmentation. Quantitative and qualitative experimental results on two public datasets (BTCV and MSD) demonstrate that our proposed APAUNet outperforms the other methods. Concretely, our APAUNet achieves an average dice score of 87.84 on BTCV, 84.48 on MSD-Liver and 69.13 on MSD-Pancreas, and significantly surpass the previous SOTA methods on small targets.

在三维医学图像分割中，小目标分割是诊断的关键，但仍然面临挑战。在本文中，我们提出了坐标轴投影注意单元(Axis Projection Attention UNet)，命名为APAUNet，用于三维医学图像的分割，特别是小目标的分割。考虑到背景在三维特征空间中所占的比例较大，我们引入了一种投影策略，将三维特征投影到三个正交的二维平面上，以从不同的角度捕捉上下文关注。这样可以过滤掉多余的特征信息，减轻3D扫描中小病灶关键信息的丢失。然后利用维数杂交策略融合不同轴上的三维特征，并通过加权求和进行合并，自适应学习不同视角的重要性。最后，在APA Decoder中，我们将二维投影过程中的高分辨率和低分辨率特征拼接在一起，从而获得更精确的多尺度信息，这对小病灶分割至关重要。在两个公共数据集(BTCV和MSD)上的定量和定性实验结果表明，我们提出的apunet优于其他方法。具体来说，我们的apunet在BTCV上的平均骰子得分为87.84，在msd -肝脏上的平均骰子得分为84.48，在msd -胰腺上的平均骰子得分为69.13，在小目标上明显超过了以前的SOTA方法。

{"title":"APAUNet: Axis Projection Attention UNet for Small Target in 3D Medical Segmentation","authors":"Yuncheng Jiang, Zixun Zhang, Shixi Qin, Yao Guo, Zhuguo Li, Shuguang Cui","doi":"10.48550/arXiv.2210.01485","DOIUrl":"https://doi.org/10.48550/arXiv.2210.01485","url":null,"abstract":"In 3D medical image segmentation, small targets segmentation is crucial for diagnosis but still faces challenges. In this paper, we propose the Axis Projection Attention UNet, named APAUNet, for 3D medical image segmentation, especially for small targets. Considering the large proportion of the background in the 3D feature space, we introduce a projection strategy to project the 3D features into three orthogonal 2D planes to capture the contextual attention from different views. In this way, we can filter out the redundant feature information and mitigate the loss of critical information for small lesions in 3D scans. Then we utilize a dimension hybridization strategy to fuse the 3D features with attention from different axes and merge them by a weighted summation to adaptively learn the importance of different perspectives. Finally, in the APA Decoder, we concatenate both high and low resolution features in the 2D projection process, thereby obtaining more precise multi-scale information, which is vital for small lesion segmentation. Quantitative and qualitative experimental results on two public datasets (BTCV and MSD) demonstrate that our proposed APAUNet outperforms the other methods. Concretely, our APAUNet achieves an average dice score of 87.84 on BTCV, 84.48 on MSD-Liver and 69.13 on MSD-Pancreas, and significantly surpass the previous SOTA methods on small targets.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80871019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Fully Transformer Network for Change Detection of Remote Sensing Images 遥感图像变化检测的全变压器网络

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-10-03 DOI: 10.48550/arXiv.2210.00757

Tianyu Yan, Zifu Wan, Pingping Zhang

Recently, change detection (CD) of remote sensing images have achieved great progress with the advances of deep learning. However, current methods generally deliver incomplete CD regions and irregular CD boundaries due to the limited representation ability of the extracted visual features. To relieve these issues, in this work we propose a novel learning framework named Fully Transformer Network (FTN) for remote sensing image CD, which improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner. More specifically, the proposed framework first utilizes the advantages of Transformers in long-range dependency modeling. It can help to learn more discriminative global-level features and obtain complete CD regions. Then, we introduce a pyramid structure to aggregate multi-level visual features from Transformers for feature enhancement. The pyramid structure grafted with a Progressive Attention Module (PAM) can improve the feature representation ability with additional interdependencies through channel attentions. Finally, to better train the framework, we utilize the deeply-supervised learning with multiple boundaryaware loss functions. Extensive experiments demonstrate that our proposed method achieves a new state-of-the-art performance on four public CD benchmarks. For model reproduction, the source code is released at https://github.com/AI-Zhpp/FTN.

近年来，随着深度学习的发展，遥感图像的变化检测取得了很大的进展。然而，由于提取的视觉特征的表达能力有限，目前的方法通常提供不完整的CD区域和不规则的CD边界。为了解决这些问题，本文提出了一种新的遥感图像CD学习框架——全变形网络(FTN)，该框架改进了从全局视图提取特征，并以金字塔的方式组合了多层次的视觉特征。更具体地说，提出的框架首先利用了transformer在远程依赖关系建模方面的优势。它有助于学习更多的判别性全局特征，获得完整的CD区域。然后，我们引入金字塔结构来聚合变形金刚的多层次视觉特征，进行特征增强。接枝渐进式注意模块(PAM)的金字塔结构可以通过通道注意增加相互依赖的特征表示能力。最后，为了更好地训练框架，我们利用了具有多个边界感知损失函数的深度监督学习。大量的实验表明，我们提出的方法在四个公共CD基准上实现了新的最先进的性能。要复制模型，源代码可在https://github.com/AI-Zhpp/FTN上发布。

{"title":"Fully Transformer Network for Change Detection of Remote Sensing Images","authors":"Tianyu Yan, Zifu Wan, Pingping Zhang","doi":"10.48550/arXiv.2210.00757","DOIUrl":"https://doi.org/10.48550/arXiv.2210.00757","url":null,"abstract":"Recently, change detection (CD) of remote sensing images have achieved great progress with the advances of deep learning. However, current methods generally deliver incomplete CD regions and irregular CD boundaries due to the limited representation ability of the extracted visual features. To relieve these issues, in this work we propose a novel learning framework named Fully Transformer Network (FTN) for remote sensing image CD, which improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner. More specifically, the proposed framework first utilizes the advantages of Transformers in long-range dependency modeling. It can help to learn more discriminative global-level features and obtain complete CD regions. Then, we introduce a pyramid structure to aggregate multi-level visual features from Transformers for feature enhancement. The pyramid structure grafted with a Progressive Attention Module (PAM) can improve the feature representation ability with additional interdependencies through channel attentions. Finally, to better train the framework, we utilize the deeply-supervised learning with multiple boundaryaware loss functions. Extensive experiments demonstrate that our proposed method achieves a new state-of-the-art performance on four public CD benchmarks. For model reproduction, the source code is released at https://github.com/AI-Zhpp/FTN.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76803016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

SymmNeRF: Learning to Explore Symmetry Prior for Single-View View Synthesis SymmNeRF:学习探索单视图视图合成的对称性先验

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-09-29 DOI: 10.48550/arXiv.2209.14819

Xingyi Li, Chaoyi Hong, Yiran Wang, Z. Cao, Ke Xian, Guosheng Lin

We study the problem of novel view synthesis of objects from a single image. Existing methods have demonstrated the potential in single-view view synthesis. However, they still fail to recover the fine appearance details, especially in self-occluded areas. This is because a single view only provides limited information. We observe that manmade objects usually exhibit symmetric appearances, which introduce additional prior knowledge. Motivated by this, we investigate the potential performance gains of explicitly embedding symmetry into the scene representation. In this paper, we propose SymmNeRF, a neural radiance field (NeRF) based framework that combines local and global conditioning under the introduction of symmetry priors. In particular, SymmNeRF takes the pixel-aligned image features and the corresponding symmetric features as extra inputs to the NeRF, whose parameters are generated by a hypernetwork. As the parameters are conditioned on the image-encoded latent codes, SymmNeRF is thus scene-independent and can generalize to new scenes. Experiments on synthetic and real-world datasets show that SymmNeRF synthesizes novel views with more details regardless of the pose transformation, and demonstrates good generalization when applied to unseen objects. Code is available at: https://github.com/xingyi-li/SymmNeRF.

研究了单幅图像中目标的新视图合成问题。现有的方法已经证明了单视图视图合成的潜力。然而，它们仍然无法恢复精细的外观细节，特别是在自遮挡区域。这是因为单个视图只能提供有限的信息。我们观察到人造物体通常呈现对称的外观，这引入了额外的先验知识。基于此，我们研究了在场景表示中显式嵌入对称性的潜在性能增益。在本文中，我们提出了一个基于神经辐射场(NeRF)的框架，该框架在对称先验的引入下结合了局部和全局条件。特别是，SymmNeRF将像素对齐的图像特征和相应的对称特征作为NeRF的额外输入，NeRF的参数由超网络生成。由于参数以图像编码的潜在码为条件，因此SymmNeRF与场景无关，可以推广到新的场景。在合成数据集和真实世界数据集上的实验表明，无论姿态变换如何，SymmNeRF都可以合成具有更多细节的新视图，并且在应用于看不见的物体时表现出良好的泛化。代码可从https://github.com/xingyi-li/SymmNeRF获得。

{"title":"SymmNeRF: Learning to Explore Symmetry Prior for Single-View View Synthesis","authors":"Xingyi Li, Chaoyi Hong, Yiran Wang, Z. Cao, Ke Xian, Guosheng Lin","doi":"10.48550/arXiv.2209.14819","DOIUrl":"https://doi.org/10.48550/arXiv.2209.14819","url":null,"abstract":"We study the problem of novel view synthesis of objects from a single image. Existing methods have demonstrated the potential in single-view view synthesis. However, they still fail to recover the fine appearance details, especially in self-occluded areas. This is because a single view only provides limited information. We observe that manmade objects usually exhibit symmetric appearances, which introduce additional prior knowledge. Motivated by this, we investigate the potential performance gains of explicitly embedding symmetry into the scene representation. In this paper, we propose SymmNeRF, a neural radiance field (NeRF) based framework that combines local and global conditioning under the introduction of symmetry priors. In particular, SymmNeRF takes the pixel-aligned image features and the corresponding symmetric features as extra inputs to the NeRF, whose parameters are generated by a hypernetwork. As the parameters are conditioned on the image-encoded latent codes, SymmNeRF is thus scene-independent and can generalize to new scenes. Experiments on synthetic and real-world datasets show that SymmNeRF synthesizes novel views with more details regardless of the pose transformation, and demonstrates good generalization when applied to unseen objects. Code is available at: https://github.com/xingyi-li/SymmNeRF.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74844097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Thinking Hallucination for Video Captioning 思考幻觉的视频字幕

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-09-28 DOI: 10.48550/arXiv.2209.13853

Nasib Ullah, Partha Pratim Mohanta

With the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite the performance improvement, video captioning models are prone to hallucination. Hallucination refers to the generation of highly pathological descriptions that are detached from the source material. In video captioning, there are two kinds of hallucination: object and action hallucination. Instead of endeavoring to learn better representations of a video, in this work, we investigate the fundamental sources of the hallucination problem. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. To alleviate these problems, we propose two robust solutions: (a) the introduction of auxiliary heads trained in multi-label settings on top of the extracted visual features and (b) the addition of context gates, which dynamically select the features during fusion. The standard evaluation metrics for video captioning measures similarity with ground truth captions and do not adequately capture object and action relevance. To this end, we propose a new metric, COAHA (caption object and action hallucination assessment), which assesses the degree of hallucination. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets, especially by a massive margin in CIDEr score.

随着丰富的视觉表示和预训练语言模型的出现，视频字幕随着时间的推移不断改进。尽管性能有所提高，但视频字幕模型容易产生幻觉。幻觉是指脱离原始材料产生的高度病态的描述。在视频字幕中，有两种幻觉:物体幻觉和动作幻觉。在这项工作中，我们不是努力学习更好地呈现视频，而是研究幻觉问题的基本来源。我们确定了三个主要因素:(i)从预训练模型中提取的视觉特征不足，(ii)多模态融合过程中源和目标上下文的不当影响，以及(iii)训练策略中的暴露偏差。为了缓解这些问题，我们提出了两种鲁棒的解决方案:(a)在提取的视觉特征之上引入多标签设置训练的辅助头部;(b)添加上下文门，在融合过程中动态选择特征。视频字幕的标准评估指标衡量的是与地面真实字幕的相似性，并没有充分捕捉对象和动作的相关性。为此，我们提出了一个新的度量，COAHA(标题对象和动作幻觉评估)，以评估幻觉的程度。我们的方法在msr -视频到文本(MSR-VTT)和微软研究视频描述语料库(MSVD)数据集上实现了最先进的性能，特别是在CIDEr得分上有很大的差距。

{"title":"Thinking Hallucination for Video Captioning","authors":"Nasib Ullah, Partha Pratim Mohanta","doi":"10.48550/arXiv.2209.13853","DOIUrl":"https://doi.org/10.48550/arXiv.2209.13853","url":null,"abstract":"With the advent of rich visual representations and pre-trained language models, video captioning has seen continuous improvement over time. Despite the performance improvement, video captioning models are prone to hallucination. Hallucination refers to the generation of highly pathological descriptions that are detached from the source material. In video captioning, there are two kinds of hallucination: object and action hallucination. Instead of endeavoring to learn better representations of a video, in this work, we investigate the fundamental sources of the hallucination problem. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. To alleviate these problems, we propose two robust solutions: (a) the introduction of auxiliary heads trained in multi-label settings on top of the extracted visual features and (b) the addition of context gates, which dynamically select the features during fusion. The standard evaluation metrics for video captioning measures similarity with ground truth captions and do not adequately capture object and action relevance. To this end, we propose a new metric, COAHA (caption object and action hallucination assessment), which assesses the degree of hallucination. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets, especially by a massive margin in CIDEr score.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75727666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Weighted Contrastive Hashing 加权对比哈希

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-09-28 DOI: 10.48550/arXiv.2209.14099

Jiaguo Yu, Huming Qiu, Dubing Chen, Haofeng Zhang

The development of unsupervised hashing is advanced by the recent popular contrastive learning paradigm. However, previous contrastive learning-based works have been hampered by (1) insufficient data similarity mining based on global-only image representations, and (2) the hash code semantic loss caused by the data augmentation. In this paper, we propose a novel method, namely Weighted Contrative Hashing (WCH), to take a step towards solving these two problems. We introduce a novel mutual attention module to alleviate the problem of information asymmetry in network features caused by the missing image structure during contrative augmentation. Furthermore, we explore the fine-grained semantic relations between images, i.e., we divide the images into multiple patches and calculate similarities between patches. The aggregated weighted similarities, which reflect the deep image relations, are distilled to facilitate the hash codes learning with a distillation loss, so as to obtain better retrieval performance. Extensive experiments show that the proposed WCH significantly outperforms existing unsupervised hashing methods on three benchmark datasets.

无监督哈希的发展是由最近流行的对比学习范式推动的。然而，以往基于对比学习的工作受到以下问题的阻碍:(1)基于全局图像表示的数据相似度挖掘不足;(2)数据增强导致的哈希码语义丢失。在本文中，我们提出了一种新的方法，即加权对比哈希(WCH)，来解决这两个问题。我们引入了一种新的相互关注模块，以缓解在收缩增强过程中由于图像结构缺失而导致的网络特征信息不对称问题。进一步，我们探索图像之间的细粒度语义关系，即我们将图像分成多个patch，并计算patch之间的相似度。将反映图像深度关系的加权相似度聚合后进行蒸馏，使哈希码学习具有一定的蒸馏损失，从而获得更好的检索性能。大量实验表明，在三个基准数据集上，所提出的WCH显着优于现有的无监督哈希方法。

{"title":"Weighted Contrastive Hashing","authors":"Jiaguo Yu, Huming Qiu, Dubing Chen, Haofeng Zhang","doi":"10.48550/arXiv.2209.14099","DOIUrl":"https://doi.org/10.48550/arXiv.2209.14099","url":null,"abstract":"The development of unsupervised hashing is advanced by the recent popular contrastive learning paradigm. However, previous contrastive learning-based works have been hampered by (1) insufficient data similarity mining based on global-only image representations, and (2) the hash code semantic loss caused by the data augmentation. In this paper, we propose a novel method, namely Weighted Contrative Hashing (WCH), to take a step towards solving these two problems. We introduce a novel mutual attention module to alleviate the problem of information asymmetry in network features caused by the missing image structure during contrative augmentation. Furthermore, we explore the fine-grained semantic relations between images, i.e., we divide the images into multiple patches and calculate similarities between patches. The aggregated weighted similarities, which reflect the deep image relations, are distilled to facilitate the hash codes learning with a distillation loss, so as to obtain better retrieval performance. Extensive experiments show that the proposed WCH significantly outperforms existing unsupervised hashing methods on three benchmark datasets.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80358768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Neural Network Panning: Screening the Optimal Sparse Network Before Training 神经网络规划:训练前筛选最优稀疏网络

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

Pub Date : 2022-09-27 DOI: 10.48550/arXiv.2209.13378

Xiatao Kang, P. Li, Jiayi Yao, Chengxi Li

Pruning on neural networks before training not only compresses the original models, but also accelerates the network training phase, which has substantial application value. The current work focuses on fine-grained pruning, which uses metrics to calculate weight scores for weight screening, and extends from the initial single-order pruning to iterative pruning. Through these works, we argue that network pruning can be summarized as an expressive force transfer process of weights, where the reserved weights will take on the expressive force from the removed ones for the purpose of maintaining the performance of original networks. In order to achieve optimal expressive force scheduling, we propose a pruning scheme before training called Neural Network Panning which guides expressive force transfer through multi-index and multi-process steps, and designs a kind of panning agent based on reinforcement learning to automate processes. Experimental results show that Panning performs better than various available pruning before training methods.

神经网络训练前的剪枝不仅压缩了原始模型，而且加快了网络训练阶段，具有重要的应用价值。目前的工作重点是细粒度剪枝，它使用度量来计算权重分数进行权重筛选，并从最初的单阶剪枝扩展到迭代剪枝。通过这些工作，我们认为网络修剪可以概括为一个权值的表达力传递过程，其中保留的权值将承担被删除权值的表达力，以保持原始网络的性能。为了实现最优的表达力调度，提出了一种训练前修剪方案——神经网络平移，通过多指标、多过程的步骤引导表达力传递，并设计了一种基于强化学习的平移智能体实现过程自动化。实验结果表明，Panning的训练效果优于现有的各种训练前修剪方法。

引用次数: 0