Comput. Vis. Image Underst.最新文献

英文中文

Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval 跨模态检索的细粒度双向注意生成和知识辅助网络

Comput. Vis. Image Underst.

Pub Date : 2022-06-01 DOI: 10.2139/ssrn.4072473

Jianwei Zhu, Zhixin Li, Jiahui Wei, Yukun Zeng, Huifang Ma

引用次数: 1

Improved real-time three-dimensional stereo matching with local consistency 改进实时三维立体匹配与局部一致性

Comput. Vis. Image Underst.

Pub Date : 2022-06-01 DOI: 10.2139/ssrn.4085389

Xiaoqian Ye, B. Yan, Boyang Liu, Huachun Wang, Shuai Qi, Duo Chen, Peng Wang, Kuiru Wang, X. Sang

引用次数: 2

Reinforced Pedestrian Attribute Recognition with Group Optimization Reward 基于群体优化奖励的行人属性识别

Comput. Vis. Image Underst.

Pub Date : 2022-05-21 DOI: 10.48550/arXiv.2205.14042

Zhong Ji, Zhenfei Hu, Yaodong Wang, Shengjia Li

Pedestrian Attribute Recognition (PAR) is a challenging task in intelligent video surveillance. Two key challenges in PAR include complex alignment relations between images and attributes, and imbalanced data distribution. Existing approaches usually formulate PAR as a recognition task. Different from them, this paper addresses it as a decision-making task via a reinforcement learning framework. Speciﬁcally, PAR is formulated as a Markov decision process (MDP) by designing ingenious states, action space, reward function and state transition. To alleviate the inter-attribute imbalance problem, we apply an Attribute Grouping Strategy (AGS) by dividing all attributes into subgroups according to their region and category information. Then we employ an agent to recognize each group of attributes, which is trained with Deep Q-learning algorithm. We also propose a Group Optimization Reward (GOR) function to alleviate the intra-attribute imbalance problem. Experimental results on the three benchmark datasets of PETA, RAP and PA100K illustrate the effectiveness and competitiveness of the proposed approach and demonstrate that the application of reinforcement learning to PAR is a valuable research direction.

行人属性识别是智能视频监控中的一个难点。PAR面临的两个关键挑战包括图像和属性之间复杂的对齐关系以及数据分布的不平衡。现有的方法通常将PAR定义为一个识别任务。与它们不同的是，本文通过强化学习框架将其作为决策任务来处理。具体来说，PAR通过设计巧妙的状态、动作空间、奖励函数和状态转移，将其表述为马尔可夫决策过程(MDP)。为了缓解属性间不平衡问题，我们采用了一种属性分组策略(AGS)，将所有属性根据其区域和类别信息划分为子组。然后，我们使用一个智能体来识别每组属性，并使用深度q -学习算法对其进行训练。我们还提出了一个组优化奖励(GOR)函数来缓解属性内不平衡问题。在PETA、RAP和PA100K三个基准数据集上的实验结果表明了所提方法的有效性和竞争力，并证明了将强化学习应用于PAR是一个有价值的研究方向。

{"title":"Reinforced Pedestrian Attribute Recognition with Group Optimization Reward","authors":"Zhong Ji, Zhenfei Hu, Yaodong Wang, Shengjia Li","doi":"10.48550/arXiv.2205.14042","DOIUrl":"https://doi.org/10.48550/arXiv.2205.14042","url":null,"abstract":"Pedestrian Attribute Recognition (PAR) is a challenging task in intelligent video surveillance. Two key challenges in PAR include complex alignment relations between images and attributes, and imbalanced data distribution. Existing approaches usually formulate PAR as a recognition task. Different from them, this paper addresses it as a decision-making task via a reinforcement learning framework. Speciﬁcally, PAR is formulated as a Markov decision process (MDP) by designing ingenious states, action space, reward function and state transition. To alleviate the inter-attribute imbalance problem, we apply an Attribute Grouping Strategy (AGS) by dividing all attributes into subgroups according to their region and category information. Then we employ an agent to recognize each group of attributes, which is trained with Deep Q-learning algorithm. We also propose a Group Optimization Reward (GOR) function to alleviate the intra-attribute imbalance problem. Experimental results on the three benchmark datasets of PETA, RAP and PA100K illustrate the effectiveness and competitiveness of the proposed approach and demonstrate that the application of reinforcement learning to PAR is a valuable research direction.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82958034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fast, accurate and robust registration of multiple depth sensors without need for RGB and IR images 快速，准确和鲁棒的多个深度传感器配准，无需RGB和IR图像

Comput. Vis. Image Underst.

Pub Date : 2022-05-17 DOI: 10.1007/s00371-022-02505-2

Andre Mühlenbrock, Roland Fischer, Christoph Schröder‐Dering, René Weller, G. Zachmann

引用次数: 2

Continual learning on 3D point clouds with random compressed rehearsal 持续学习三维点云与随机压缩排练

Comput. Vis. Image Underst.

Pub Date : 2022-05-16 DOI: 10.48550/arXiv.2205.08013

M. Zamorski, Michal Stypulkowski, Konrad Karanowski, Tomasz Trzci'nski, Maciej Ziȩba

Contemporary deep neural networks offer state-of-the-art results when applied to visual reasoning, e.g., in the context of 3D point cloud data. Point clouds are important datatype for precise modeling of three-dimensional environments, but effective processing of this type of data proves to be challenging. In the world of large, heavily-parameterized network architectures and continuously-streamed data, there is an increasing need for machine learning models that can be trained on additional data. Unfortunately, currently available models cannot fully leverage training on additional data without losing their past knowledge. Combating this phenomenon, called catastrophic forgetting, is one of the main objectives of continual learning. Continual learning for deep neural networks has been an active field of research, primarily in 2D computer vision, natural language processing, reinforcement learning, and robotics. However, in 3D computer vision, there are hardly any continual learning solutions specifically designed to take advantage of point cloud structure. This work proposes a novel neural network architecture capable of continual learning on 3D point cloud data. We utilize point cloud structure properties for preserving a heavily compressed set of past data. By using rehearsal and reconstruction as regularization methods of the learning process, our approach achieves a significant decrease of catastrophic forgetting compared to the existing solutions on several most popular point cloud datasets considering two continual learning settings: when a task is known beforehand, and in the challenging scenario of when task information is unknown to the model.

当代深度神经网络在应用于视觉推理时提供了最先进的结果，例如在3D点云数据的背景下。点云是三维环境精确建模的重要数据类型，但对这类数据的有效处理具有挑战性。在大型、重参数化的网络架构和连续流数据的世界中，对可以在额外数据上进行训练的机器学习模型的需求越来越大。不幸的是，目前可用的模型不能在不失去过去知识的情况下充分利用额外数据的训练。与这种被称为灾难性遗忘的现象作斗争，是持续学习的主要目标之一。深度神经网络的持续学习一直是一个活跃的研究领域，主要是在二维计算机视觉、自然语言处理、强化学习和机器人技术方面。然而，在3D计算机视觉中，几乎没有专门设计的持续学习解决方案来利用点云结构。这项工作提出了一种新的神经网络架构，能够在三维点云数据上持续学习。我们利用点云结构属性来保存一个严重压缩的过去数据集。通过使用排练和重建作为学习过程的正则化方法，我们的方法与考虑两种连续学习设置的几个最流行的点云数据集的现有解决方案相比，实现了灾难性遗忘的显著减少:当任务事先已知时，以及在任务信息对模型未知的挑战性场景中。

{"title":"Continual learning on 3D point clouds with random compressed rehearsal","authors":"M. Zamorski, Michal Stypulkowski, Konrad Karanowski, Tomasz Trzci'nski, Maciej Ziȩba","doi":"10.48550/arXiv.2205.08013","DOIUrl":"https://doi.org/10.48550/arXiv.2205.08013","url":null,"abstract":"Contemporary deep neural networks offer state-of-the-art results when applied to visual reasoning, e.g., in the context of 3D point cloud data. Point clouds are important datatype for precise modeling of three-dimensional environments, but effective processing of this type of data proves to be challenging. In the world of large, heavily-parameterized network architectures and continuously-streamed data, there is an increasing need for machine learning models that can be trained on additional data. Unfortunately, currently available models cannot fully leverage training on additional data without losing their past knowledge. Combating this phenomenon, called catastrophic forgetting, is one of the main objectives of continual learning. Continual learning for deep neural networks has been an active field of research, primarily in 2D computer vision, natural language processing, reinforcement learning, and robotics. However, in 3D computer vision, there are hardly any continual learning solutions specifically designed to take advantage of point cloud structure. This work proposes a novel neural network architecture capable of continual learning on 3D point cloud data. We utilize point cloud structure properties for preserving a heavily compressed set of past data. By using rehearsal and reconstruction as regularization methods of the learning process, our approach achieves a significant decrease of catastrophic forgetting compared to the existing solutions on several most popular point cloud datasets considering two continual learning settings: when a task is known beforehand, and in the challenging scenario of when task information is unknown to the model.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76712660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection 基本:用于时间动作检测的令人震惊的RGB-Only基线

Comput. Vis. Image Underst.

Pub Date : 2022-05-05 DOI: 10.48550/arXiv.2205.02717

Mingdong Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, Limin Wang

Temporal action detection (TAD) is extensively studied in the video understanding community by generally following the object detection pipeline in images. However, complex designs are not uncommon in TAD, such as two-stream feature extraction, multi-stage training, complex temporal modeling, and global context fusion. In this paper, we do not aim to introduce any novel technique for TAD. Instead, we study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD. In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head. We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline thanks to the simplicity of design. As a result, this simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs. In addition, we further improve the BasicTAD by preserving more temporal and spatial information in network representation (termed as PlusTAD). Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction. Meanwhile, we also perform in-depth visualization and error analysis on our proposed method and try to provide more insights on the TAD problem. Our approach can serve as a strong baseline for future TAD research. The code and model will be released at https://github.com/MCG-NJU/BasicTAD.

时间动作检测(TAD)在视频理解界得到了广泛的研究，它通常遵循图像中的目标检测管道。然而，复杂的设计在TAD中并不少见，如双流特征提取、多阶段训练、复杂时间建模和全局上下文融合。在本文中，我们不打算介绍任何新的TAD技术。相反，我们研究了一个简单，直接，但必须知道的基线，考虑到TAD的复杂设计和低检测效率的现状。在我们的简单基线(称为BasicTAD)中，我们将TAD管道分解为几个基本组件:数据采样、主干设计、颈部构造和检测头。为了这个基线，我们广泛地研究了每个组件中的现有技术，更重要的是，由于设计的简单性，我们在整个管道上执行端到端的训练。因此，这个简单的BasicTAD产生了令人惊叹的实时RGB-Only基线，非常接近具有两流输入的最先进方法。此外，我们通过在网络表示中保留更多的时空信息(称为PlusTAD)进一步改进了BasicTAD。实证结果表明，我们的PlusTAD非常高效，在THUMOS14和FineAction数据集上的性能明显优于之前的方法。同时，我们还对所提出的方法进行了深入的可视化和误差分析，试图为TAD问题提供更多的见解。我们的方法可以为未来的TAD研究提供强有力的基础。代码和模型将在https://github.com/MCG-NJU/BasicTAD上发布。

{"title":"BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection","authors":"Mingdong Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, Limin Wang","doi":"10.48550/arXiv.2205.02717","DOIUrl":"https://doi.org/10.48550/arXiv.2205.02717","url":null,"abstract":"Temporal action detection (TAD) is extensively studied in the video understanding community by generally following the object detection pipeline in images. However, complex designs are not uncommon in TAD, such as two-stream feature extraction, multi-stage training, complex temporal modeling, and global context fusion. In this paper, we do not aim to introduce any novel technique for TAD. Instead, we study a simple, straightforward, yet must-known baseline given the current status of complex design and low detection efficiency in TAD. In our simple baseline (termed BasicTAD), we decompose the TAD pipeline into several essential components: data sampling, backbone design, neck construction, and detection head. We extensively investigate the existing techniques in each component for this baseline, and more importantly, perform end-to-end training over the entire pipeline thanks to the simplicity of design. As a result, this simple BasicTAD yields an astounding and real-time RGB-Only baseline very close to the state-of-the-art methods with two-stream inputs. In addition, we further improve the BasicTAD by preserving more temporal and spatial information in network representation (termed as PlusTAD). Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction. Meanwhile, we also perform in-depth visualization and error analysis on our proposed method and try to provide more insights on the TAD problem. Our approach can serve as a strong baseline for future TAD research. The code and model will be released at https://github.com/MCG-NJU/BasicTAD.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74452018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Semantically Accurate Super-Resolution Generative Adversarial Networks 语义准确的超分辨率生成对抗网络

Comput. Vis. Image Underst.

Pub Date : 2022-05-01 DOI: 10.48550/arXiv.2205.08659

Tristan Frizza, D. Dansereau, Nagita Mehr Seresht, M. Bewley

This work addresses the problems of semantic segmentation and image super-resolution by jointly considering the performance of both in training a Generative Adversarial Network (GAN). We propose a novel architecture and domain-speciﬁc feature loss, allowing super-resolution to operate as a pre-processing step to increase the performance of downstream computer vision tasks, speciﬁcally semantic segmentation. We demonstrate this approach using Nearmap’s aerial imagery dataset which covers hundreds of urban areas at 5-7 cm per pixel resolution. We show the proposed approach improves perceived image quality as well as quantitative segmentation accuracy across all prediction classes, yielding an average accuracy improvement of 11.8% and 108% at 4 × and 32 × super-resolution, compared with state-of-the art single-network methods. This work demonstrates that jointly considering image-based and task-speciﬁc losses can improve the performance of both, and advances the state-of-the-art in semantic-aware super-resolution of aerial imagery. 1: A comparison of of three potential generator model architec- tures for 4 × super-resolution. We chose RRDN for all subsequent ex-periments due to its superior overall performance on pixel-wise loss

这项工作通过联合考虑两者在训练生成对抗网络(GAN)中的性能来解决语义分割和图像超分辨率的问题。我们提出了一种新的架构和特定领域的特征损失，允许超分辨率作为预处理步骤来提高下游计算机视觉任务的性能，特别是语义分割。我们使用Nearmap的航空图像数据集来演示这种方法，该数据集以每像素5-7厘米的分辨率覆盖了数百个城市地区。我们表明，所提出的方法提高了感知图像质量以及所有预测类别的定量分割精度，与最先进的单网络方法相比，在4 ×和32 ×超分辨率下的平均精度提高了11.8%和108%。这项工作表明，联合考虑基于图像和特定任务的损失可以提高两者的性能，并推进了航空图像语义感知超分辨率的最新技术。1 . 4 ×超分辨率三种潜在发电机模型体系结构的比较。我们选择RRDN进行所有后续实验，因为它在像素级损失方面的整体性能优越

{"title":"Semantically Accurate Super-Resolution Generative Adversarial Networks","authors":"Tristan Frizza, D. Dansereau, Nagita Mehr Seresht, M. Bewley","doi":"10.48550/arXiv.2205.08659","DOIUrl":"https://doi.org/10.48550/arXiv.2205.08659","url":null,"abstract":"This work addresses the problems of semantic segmentation and image super-resolution by jointly considering the performance of both in training a Generative Adversarial Network (GAN). We propose a novel architecture and domain-speciﬁc feature loss, allowing super-resolution to operate as a pre-processing step to increase the performance of downstream computer vision tasks, speciﬁcally semantic segmentation. We demonstrate this approach using Nearmap’s aerial imagery dataset which covers hundreds of urban areas at 5-7 cm per pixel resolution. We show the proposed approach improves perceived image quality as well as quantitative segmentation accuracy across all prediction classes, yielding an average accuracy improvement of 11.8% and 108% at 4 × and 32 × super-resolution, compared with state-of-the art single-network methods. This work demonstrates that jointly considering image-based and task-speciﬁc losses can improve the performance of both, and advances the state-of-the-art in semantic-aware super-resolution of aerial imagery. 1: A comparison of of three potential generator model architec- tures for 4 × super-resolution. We chose RRDN for all subsequent ex-periments due to its superior overall performance on pixel-wise loss","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74313737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Real-time semantic segmentation with local spatial pixel adjustment 基于局部空间像素调整的实时语义分割

Comput. Vis. Image Underst.

Pub Date : 2022-05-01 DOI: 10.2139/ssrn.4053470

Cunjun Xiao, Xingjun Hao, Haibin Li, Yaqian Li, Wengming Zhang

引用次数: 7

End-to-end weakly-supervised single-stage multiple 3D hand mesh reconstruction from a single RGB image 端到端弱监督单阶段多三维手工网格重建从单个RGB图像

Comput. Vis. Image Underst.

Pub Date : 2022-04-18 DOI: 10.2139/ssrn.4199294

Jinwei Ren, Jianke Zhu, Jialiang Zhang

In this paper, we consider the challenging task of simultaneously locating and recovering multiple hands from a single 2D image. Previous studies either focus on single hand reconstruction or solve this problem in a multi-stage way. Moreover, the conventional two-stage pipeline firstly detects hand areas, and then estimates 3D hand pose from each cropped patch. To reduce the computational redundancy in preprocessing and feature extraction, for the first time, we propose a concise but efficient single-stage pipeline for multi-hand reconstruction. Specifically, we design a multi-head auto-encoder structure, where each head network shares the same feature map and outputs the hand center, pose and texture, respectively. Besides, we adopt a weakly-supervised scheme to alleviate the burden of expensive 3D real-world data annotations. To this end, we propose a series of losses optimized by a stage-wise training scheme, where a multi-hand dataset with 2D annotations is generated based on the publicly available single hand datasets. In order to further improve the accuracy of the weakly supervised model, we adopt several feature consistency constraints in both single and multiple hand settings. Specifically, the keypoints of each hand estimated from local features should be consistent with the re-projected points predicted from global features. Extensive experiments on public benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners. The code and models are available at {https://github.com/zijinxuxu/SMHR}.

在本文中，我们考虑了从单个2D图像中同时定位和恢复多只手的挑战性任务。以往的研究要么集中在单手重建，要么采用多阶段的方法解决这一问题。此外，传统的两阶段流水线首先检测手部区域，然后从每个裁剪的斑块中估计3D手部姿态。为了减少预处理和特征提取中的计算冗余，我们首次提出了一种简洁高效的单级多手重建管道。具体来说，我们设计了一个多头自编码器结构，其中每个头部网络共享相同的特征映射，并分别输出手的中心、姿态和纹理。此外，我们采用了一种弱监督的方案来减轻昂贵的3D真实数据注释的负担。为此，我们提出了一系列通过阶段智能训练方案优化的损失，其中基于公开可用的单手数据集生成具有2D注释的多手数据集。为了进一步提高弱监督模型的准确性，我们在单手和多手设置中都采用了几个特征一致性约束。具体来说，从局部特征估计的每只手的关键点应该与从全局特征预测的重投影点一致。在包括FreiHAND、HO3D、InterHand2.6M和RHD在内的公共基准上进行的大量实验表明，我们的方法在弱监督和全监督两方面都优于最先进的基于模型的方法。代码和模型可在{https://github.com/zijinxuxu/SMHR}上获得。

{"title":"End-to-end weakly-supervised single-stage multiple 3D hand mesh reconstruction from a single RGB image","authors":"Jinwei Ren, Jianke Zhu, Jialiang Zhang","doi":"10.2139/ssrn.4199294","DOIUrl":"https://doi.org/10.2139/ssrn.4199294","url":null,"abstract":"In this paper, we consider the challenging task of simultaneously locating and recovering multiple hands from a single 2D image. Previous studies either focus on single hand reconstruction or solve this problem in a multi-stage way. Moreover, the conventional two-stage pipeline firstly detects hand areas, and then estimates 3D hand pose from each cropped patch. To reduce the computational redundancy in preprocessing and feature extraction, for the first time, we propose a concise but efficient single-stage pipeline for multi-hand reconstruction. Specifically, we design a multi-head auto-encoder structure, where each head network shares the same feature map and outputs the hand center, pose and texture, respectively. Besides, we adopt a weakly-supervised scheme to alleviate the burden of expensive 3D real-world data annotations. To this end, we propose a series of losses optimized by a stage-wise training scheme, where a multi-hand dataset with 2D annotations is generated based on the publicly available single hand datasets. In order to further improve the accuracy of the weakly supervised model, we adopt several feature consistency constraints in both single and multiple hand settings. Specifically, the keypoints of each hand estimated from local features should be consistent with the re-projected points predicted from global features. Extensive experiments on public benchmarks including FreiHAND, HO3D, InterHand2.6M and RHD demonstrate that our method outperforms the state-of-the-art model-based methods in both weakly-supervised and fully-supervised manners. The code and models are available at {https://github.com/zijinxuxu/SMHR}.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73698711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Video Captioning: a comparative review of where we are and which could be the route 视频字幕:对我们所处的位置和可能的路线进行比较回顾

Comput. Vis. Image Underst.

Pub Date : 2022-04-12 DOI: 10.48550/arXiv.2204.05976

Daniela Moctezuma, Tania Ram'irez-delReal, Guillermo Ruiz, Oth'on Gonz'alez-Ch'avez

Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or images sequence). The amount and relevance of the applications of video captioning are vast, mainly to deal with a significant amount of video recordings in video surveillance, or assisting people visually impaired, to mention a few. To analyze where the efforts of our community to solve the video captioning task are, as well as what route could be better to follow, this manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021. As a result, the most-used datasets and metrics are identified. Also, the main approaches used and the best ones. We compute a set of rankings based on several performance metrics to obtain, according to its performance, the best method with the best result on the video captioning task. Finally, some insights are concluded about which could be the next steps or opportunity areas to improve dealing with this complex task.

视频字幕是描述一系列图像内容的过程，捕捉其语义关系和含义。处理单个图像的任务是艰巨的，更不用说视频(或图像序列)的难度了。视频字幕的应用数量和相关性是巨大的，主要是处理视频监控中大量的视频记录，或协助视障人士，仅举几例。为了分析我们社区在解决视频字幕任务方面的努力在哪里，以及可以更好地遵循什么路线，本文对2016年至2021年期间的105多篇论文进行了广泛的回顾。因此，确定了最常用的数据集和度量标准。此外，使用的主要方法和最好的方法。我们基于几个性能指标计算一组排名，以根据其性能获得在视频字幕任务上具有最佳效果的最佳方法。最后，总结了一些见解，这些见解可以是下一步或机会领域，以改进处理这一复杂任务。

{"title":"Video Captioning: a comparative review of where we are and which could be the route","authors":"Daniela Moctezuma, Tania Ram'irez-delReal, Guillermo Ruiz, Oth'on Gonz'alez-Ch'avez","doi":"10.48550/arXiv.2204.05976","DOIUrl":"https://doi.org/10.48550/arXiv.2204.05976","url":null,"abstract":"Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or images sequence). The amount and relevance of the applications of video captioning are vast, mainly to deal with a significant amount of video recordings in video surveillance, or assisting people visually impaired, to mention a few. To analyze where the efforts of our community to solve the video captioning task are, as well as what route could be better to follow, this manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021. As a result, the most-used datasets and metrics are identified. Also, the main approaches used and the best ones. We compute a set of rankings based on several performance metrics to obtain, according to its performance, the best method with the best result on the video captioning task. Finally, some insights are concluded about which could be the next steps or opportunity areas to improve dealing with this complex task.","PeriodicalId":10549,"journal":{"name":"Comput. Vis. Image Underst.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86286089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Comput. Vis. Image Underst.

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀