首页 > 最新文献

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision最新文献

英文 中文
Tracking Small and Fast Moving Objects: A Benchmark 跟踪小而快速移动的物体:一个基准
Zhewen Zhang, Fuliang Wu, Yuming Qiu, Jingdong Liang, Shuiwang Li
With more and more large-scale datasets available for training, visual tracking has made great progress in recent years. However, current research in the field mainly focuses on tracking generic objects. In this paper, we present TSFMO, a benchmark for textbf{T}racking textbf{S}mall and textbf{F}ast textbf{M}oving textbf{O}bjects. This benchmark aims to encourage research in developing novel and accurate methods for this challenging task particularly. TSFMO consists of 250 sequences with about 50k frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box. To the best of our knowledge, TSFMO is the first benchmark dedicated to tracking small and fast moving objects, especially connected to sports. To understand how existing methods perform and to provide comparison for future research on TSFMO, we extensively evaluate 20 state-of-the-art trackers on the benchmark. The evaluation results exhibit that more effort are required to improve tracking small and fast moving objects. Moreover, to encourage future research, we proposed a novel tracker S-KeepTrack which surpasses all 20 evaluated approaches. By releasing TSFMO, we expect to facilitate future researches and applications of tracking small and fast moving objects. The TSFMO and evaluation results as well as S-KeepTrack are available at url{https://github.com/CodeOfGithub/S-KeepTrack}.
随着越来越多的大规模数据集用于训练,视觉跟踪近年来取得了很大的进展。然而,目前该领域的研究主要集中在对通用目标的跟踪上。在本文中,我们提出了TSFMO,一个textbf{跟踪}textbf{小}和textbf{快}textbf{移动}textbf{对象}的基准。该基准旨在鼓励研究开发新颖和准确的方法,特别是针对这一具有挑战性的任务。TSFMO由250个序列组成,总帧数约为50k。这些序列中的每一帧都用一个边界框仔细地手工标注。据我们所知,TSFMO是第一个专门用于跟踪小型和快速移动物体的基准,特别是与运动相关的物体。为了了解现有方法的性能,并为TSFMO的未来研究提供比较,我们在基准上广泛评估了20种最先进的跟踪器。评估结果表明,改进对小而快速运动目标的跟踪需要付出更多的努力。此外,为了鼓励未来的研究,我们提出了一种新的跟踪器S-KeepTrack,它超过了所有20种评估方法。通过发布TSFMO,我们希望能够促进未来小而快速运动物体跟踪的研究和应用。TSFMO和评估结果以及S-KeepTrack可在url{https://github.com/CodeOfGithub/S-KeepTrack}上获得。
{"title":"Tracking Small and Fast Moving Objects: A Benchmark","authors":"Zhewen Zhang, Fuliang Wu, Yuming Qiu, Jingdong Liang, Shuiwang Li","doi":"10.48550/arXiv.2209.04284","DOIUrl":"https://doi.org/10.48550/arXiv.2209.04284","url":null,"abstract":"With more and more large-scale datasets available for training, visual tracking has made great progress in recent years. However, current research in the field mainly focuses on tracking generic objects. In this paper, we present TSFMO, a benchmark for textbf{T}racking textbf{S}mall and textbf{F}ast textbf{M}oving textbf{O}bjects. This benchmark aims to encourage research in developing novel and accurate methods for this challenging task particularly. TSFMO consists of 250 sequences with about 50k frames in total. Each frame in these sequences is carefully and manually annotated with a bounding box. To the best of our knowledge, TSFMO is the first benchmark dedicated to tracking small and fast moving objects, especially connected to sports. To understand how existing methods perform and to provide comparison for future research on TSFMO, we extensively evaluate 20 state-of-the-art trackers on the benchmark. The evaluation results exhibit that more effort are required to improve tracking small and fast moving objects. Moreover, to encourage future research, we proposed a novel tracker S-KeepTrack which surpasses all 20 evaluated approaches. By releasing TSFMO, we expect to facilitate future researches and applications of tracking small and fast moving objects. The TSFMO and evaluation results as well as S-KeepTrack are available at url{https://github.com/CodeOfGithub/S-KeepTrack}.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75253036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Few-shot Adaptive Object Detection with Cross-Domain CutMix 基于跨域CutMix的少镜头自适应目标检测
Yuzuru Nakamura, Yasunori Ishii, Yuki Maruyama, Takayoshi Yamashita
In object detection, data amount and cost are a trade-off, and collecting a large amount of data in a specific domain is labor intensive. Therefore, existing large-scale datasets are used for pre-training. However, conventional transfer learning and domain adaptation cannot bridge the domain gap when the target domain differs significantly from the source domain. We propose a data synthesis method that can solve the large domain gap problem. In this method, a part of the target image is pasted onto the source image, and the position of the pasted region is aligned by utilizing the information of the object bounding box. In addition, we introduce adversarial learning to discriminate whether the original or the pasted regions. The proposed method trains on a large number of source images and a few target domain images. The proposed method achieves higher accuracy than conventional methods in a very different domain problem setting, where RGB images are the source domain, and thermal infrared images are the target domain. Similarly, the proposed method achieves higher accuracy in the cases of simulation images to real images.
在目标检测中,数据量和成本是一个权衡,收集特定领域的大量数据是劳动密集型的。因此,使用现有的大规模数据集进行预训练。然而,传统的迁移学习和领域自适应方法在目标领域与源领域差异较大的情况下,不能有效地解决领域差异问题。提出了一种能够解决大域间隙问题的数据综合方法。该方法将目标图像的一部分粘贴到源图像上,利用对象边界框的信息对粘贴区域的位置进行对齐。此外,我们引入了对抗学习来区分原始区域和粘贴区域。该方法对大量的源图像和少量的目标域图像进行训练。在RGB图像为源域,热红外图像为目标域的不同域问题设置中,该方法比传统方法具有更高的精度。同样,在模拟图像到真实图像的情况下,该方法也达到了更高的精度。
{"title":"Few-shot Adaptive Object Detection with Cross-Domain CutMix","authors":"Yuzuru Nakamura, Yasunori Ishii, Yuki Maruyama, Takayoshi Yamashita","doi":"10.48550/arXiv.2208.14586","DOIUrl":"https://doi.org/10.48550/arXiv.2208.14586","url":null,"abstract":"In object detection, data amount and cost are a trade-off, and collecting a large amount of data in a specific domain is labor intensive. Therefore, existing large-scale datasets are used for pre-training. However, conventional transfer learning and domain adaptation cannot bridge the domain gap when the target domain differs significantly from the source domain. We propose a data synthesis method that can solve the large domain gap problem. In this method, a part of the target image is pasted onto the source image, and the position of the pasted region is aligned by utilizing the information of the object bounding box. In addition, we introduce adversarial learning to discriminate whether the original or the pasted regions. The proposed method trains on a large number of source images and a few target domain images. The proposed method achieves higher accuracy than conventional methods in a very different domain problem setting, where RGB images are the source domain, and thermal infrared images are the target domain. Similarly, the proposed method achieves higher accuracy in the cases of simulation images to real images.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79770867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
CVLNet: Cross-View Semantic Correspondence Learning for Video-based Camera Localization 基于视频的摄像机定位的跨视图语义对应学习
Yujiao Shi, Xin Yu, Shanhe Wang, Hongdong Li
This paper tackles the problem of Cross-view Video-based camera Localization (CVL). The task is to localize a query camera by leveraging information from its past observations, i.e., a continuous sequence of images observed at previous time stamps, and matching them to a large overhead-view satellite image. The critical challenge of this task is to learn a powerful global feature descriptor for the sequential ground-view images while considering its domain alignment with reference satellite images. For this purpose, we introduce CVLNet, which first projects the sequential ground-view images into an overhead view by exploring the ground-and-overhead geometric correspondences and then leverages the photo consistency among the projected images to form a global representation. In this way, the cross-view domain differences are bridged. Since the reference satellite images are usually pre-cropped and regularly sampled, there is always a misalignment between the query camera location and its matching satellite image center. Motivated by this, we propose estimating the query camera's relative displacement to a satellite image before similarity matching. In this displacement estimation process, we also consider the uncertainty of the camera location. For example, a camera is unlikely to be on top of trees. To evaluate the performance of the proposed method, we collect satellite images from Google Map for the KITTI dataset and construct a new cross-view video-based localization benchmark dataset, KITTI-CVL. Extensive experiments have demonstrated the effectiveness of video-based localization over single image-based localization and the superiority of each proposed module over other alternatives.
本文研究了基于交叉视点视频的摄像机定位问题。任务是通过利用过去观测的信息来定位查询相机,即在以前的时间戳观测到的连续图像序列,并将它们与大型俯视卫星图像相匹配。该任务的关键挑战是学习一个强大的全局特征描述符,同时考虑其与参考卫星图像的域对齐。为此,我们引入了CVLNet,它首先通过探索地面和架空的几何对应关系将连续的地面视图图像投影到俯视图中,然后利用投影图像之间的照片一致性来形成全局表示。通过这种方式,跨视图域的差异被桥接。由于参考卫星图像通常是预先裁剪并定期采样的,因此查询相机位置与其匹配的卫星图像中心之间总是存在不对齐。基于此,我们提出在相似性匹配之前估计查询相机对卫星图像的相对位移。在此位移估计过程中,我们还考虑了摄像机位置的不确定性。例如,相机不太可能放在树上。为了评估该方法的性能,我们从谷歌地图上收集了KITTI数据集的卫星图像,并构建了一个新的基于交叉视点视频的定位基准数据集KITTI- cvl。大量的实验证明了基于视频的定位优于基于单个图像的定位,并且每个提出的模块优于其他替代方案。
{"title":"CVLNet: Cross-View Semantic Correspondence Learning for Video-based Camera Localization","authors":"Yujiao Shi, Xin Yu, Shanhe Wang, Hongdong Li","doi":"10.48550/arXiv.2208.03660","DOIUrl":"https://doi.org/10.48550/arXiv.2208.03660","url":null,"abstract":"This paper tackles the problem of Cross-view Video-based camera Localization (CVL). The task is to localize a query camera by leveraging information from its past observations, i.e., a continuous sequence of images observed at previous time stamps, and matching them to a large overhead-view satellite image. The critical challenge of this task is to learn a powerful global feature descriptor for the sequential ground-view images while considering its domain alignment with reference satellite images. For this purpose, we introduce CVLNet, which first projects the sequential ground-view images into an overhead view by exploring the ground-and-overhead geometric correspondences and then leverages the photo consistency among the projected images to form a global representation. In this way, the cross-view domain differences are bridged. Since the reference satellite images are usually pre-cropped and regularly sampled, there is always a misalignment between the query camera location and its matching satellite image center. Motivated by this, we propose estimating the query camera's relative displacement to a satellite image before similarity matching. In this displacement estimation process, we also consider the uncertainty of the camera location. For example, a camera is unlikely to be on top of trees. To evaluate the performance of the proposed method, we collect satellite images from Google Map for the KITTI dataset and construct a new cross-view video-based localization benchmark dataset, KITTI-CVL. Extensive experiments have demonstrated the effectiveness of video-based localization over single image-based localization and the superiority of each proposed module over other alternatives.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75838313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Explaining Deep Neural Networks for Point Clouds using Gradient-based Visualisations 使用基于梯度的可视化解释点云的深度神经网络
Jawad Tayyub, M. Sarmad, Nicolas Schonborn
Explaining decisions made by deep neural networks is a rapidly advancing research topic. In recent years, several approaches have attempted to provide visual explanations of decisions made by neural networks designed for structured 2D image input data. In this paper, we propose a novel approach to generate coarse visual explanations of networks designed to classify unstructured 3D data, namely point clouds. Our method uses gradients flowing back to the final feature map layers and maps these values as contributions of the corresponding points in the input point cloud. Due to dimensionality disagreement and lack of spatial consistency between input points and final feature maps, our approach combines gradients with points dropping to compute explanations of different parts of the point cloud iteratively. The generality of our approach is tested on various point cloud classification networks, including 'single object' networks PointNet, PointNet++, DGCNN, and a 'scene' network VoteNet. Our method generates symmetric explanation maps that highlight important regions and provide insight into the decision-making process of network architectures. We perform an exhaustive evaluation of trust and interpretability of our explanation method against comparative approaches using quantitative, quantitative and human studies. All our code is implemented in PyTorch and will be made publicly available.
解释由深度神经网络做出的决定是一个快速发展的研究课题。近年来,有几种方法试图为为结构化二维图像输入数据设计的神经网络所做的决策提供视觉解释。在本文中,我们提出了一种新的方法来生成用于分类非结构化3D数据(即点云)的网络的粗视觉解释。我们的方法使用流向最终特征映射层的梯度,并将这些值映射为输入点云中相应点的贡献。由于输入点和最终特征图之间的维数不一致和缺乏空间一致性,我们的方法结合梯度和点下降来迭代计算点云不同部分的解释。我们的方法的通用性在各种点云分类网络上进行了测试,包括“单对象”网络PointNet、pointnet++、DGCNN和“场景”网络VoteNet。我们的方法生成对称的解释图,突出了重要的区域,并提供了对网络架构决策过程的洞察。我们使用定量、定量和人类研究对我们的解释方法的信任和可解释性进行了详尽的评估。我们所有的代码都在PyTorch中实现,并将公开提供。
{"title":"Explaining Deep Neural Networks for Point Clouds using Gradient-based Visualisations","authors":"Jawad Tayyub, M. Sarmad, Nicolas Schonborn","doi":"10.48550/arXiv.2207.12984","DOIUrl":"https://doi.org/10.48550/arXiv.2207.12984","url":null,"abstract":"Explaining decisions made by deep neural networks is a rapidly advancing research topic. In recent years, several approaches have attempted to provide visual explanations of decisions made by neural networks designed for structured 2D image input data. In this paper, we propose a novel approach to generate coarse visual explanations of networks designed to classify unstructured 3D data, namely point clouds. Our method uses gradients flowing back to the final feature map layers and maps these values as contributions of the corresponding points in the input point cloud. Due to dimensionality disagreement and lack of spatial consistency between input points and final feature maps, our approach combines gradients with points dropping to compute explanations of different parts of the point cloud iteratively. The generality of our approach is tested on various point cloud classification networks, including 'single object' networks PointNet, PointNet++, DGCNN, and a 'scene' network VoteNet. Our method generates symmetric explanation maps that highlight important regions and provide insight into the decision-making process of network architectures. We perform an exhaustive evaluation of trust and interpretability of our explanation method against comparative approaches using quantitative, quantitative and human studies. All our code is implemented in PyTorch and will be made publicly available.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90711816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Self-Distilled Vision Transformer for Domain Generalization 面向领域泛化的自提取视觉变压器
M. Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, F. Khan
In the recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting of source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are publicly available at: https://github.com/maryam089/SDViT.
近年来,已经提出了几种领域泛化(DG)方法,并显示出令人鼓舞的性能,然而,几乎所有这些方法都建立在卷积神经网络(cnn)之上。视觉变压器(ViTs)的DG性能研究几乎没有进展,它在标准基准上挑战cnn的霸主地位,通常建立在i.i.d假设上。这使得vit在现实世界的部署值得怀疑。在本文中,我们试图探索vit解决DG问题。与cnn相似,ViTs在非分布情况下也存在问题,其主要原因是对源域的过拟合。受vit模块化架构的启发,我们提出了一种简单的vit DG方法,称为vit的自蒸馏。该算法通过控制中间变压器块的非零熵监控信号,简化了输入输出映射问题的学习,减少了源域的过拟合。此外,它不引入任何新的参数,可以无缝地插入到不同vit的模块化组成中。在五个具有挑战性的数据集中,我们通过经验证明了不同DG基线和各种ViT骨干的显着性能提升。此外,我们报告了与最近最先进的DG方法相比的有利性能。我们的代码以及预训练的模型可以在https://github.com/maryam089/SDViT上公开获得。
{"title":"Self-Distilled Vision Transformer for Domain Generalization","authors":"M. Sultana, Muzammal Naseer, Muhammad Haris Khan, Salman Khan, F. Khan","doi":"10.48550/arXiv.2207.12392","DOIUrl":"https://doi.org/10.48550/arXiv.2207.12392","url":null,"abstract":"In the recent past, several domain generalization (DG) methods have been proposed, showing encouraging performance, however, almost all of them build on convolutional neural networks (CNNs). There is little to no progress on studying the DG performance of vision transformers (ViTs), which are challenging the supremacy of CNNs on standard benchmarks, often built on i.i.d assumption. This renders the real-world deployment of ViTs doubtful. In this paper, we attempt to explore ViTs towards addressing the DG problem. Similar to CNNs, ViTs also struggle in out-of-distribution scenarios and the main culprit is overfitting to source domains. Inspired by the modular architecture of ViTs, we propose a simple DG approach for ViTs, coined as self-distillation for ViTs. It reduces the overfitting of source domains by easing the learning of input-output mapping problem through curating non-zero entropy supervisory signals for intermediate transformer blocks. Further, it does not introduce any new parameters and can be seamlessly plugged into the modular composition of different ViTs. We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets. Moreover, we report favorable performance against recent state-of-the-art DG methods. Our code along with pre-trained models are publicly available at: https://github.com/maryam089/SDViT.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73853866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Is an Object-Centric Video Representation Beneficial for Transfer? 以对象为中心的视频表示是否有利于传输?
Chuhan Zhang, Ankush Gupta, Andrew Zisserman
The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning of novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification.
这项工作的目的是学习以对象为中心的视频表示,目的是提高对新任务的可转移性,即不同于动作分类的预训练任务的任务。为此,我们提出了一种新的基于变压器结构的以对象为中心的视频识别模型。该模型为视频学习了一组以对象为中心的总结向量,并使用这些向量融合视频剪辑的视觉和时空轨迹“模式”。我们还引入了一种新的轨迹对比度损失来进一步增强这些总结向量的客观性。通过对四个数据集(somethingthing - v2, SomethingElse, Action Genome和EpicKitchens)的实验,我们表明,当:(1)在看不见的物体和看不见的环境上对动作进行分类时,以对象为中心的模型优于先前的视频表示(包括对象不可知和对象感知);(2)小说类低档学习;(3)对其他下游任务的线性探测;以及(4)为标准动作分类。
{"title":"Is an Object-Centric Video Representation Beneficial for Transfer?","authors":"Chuhan Zhang, Ankush Gupta, Andrew Zisserman","doi":"10.48550/arXiv.2207.10075","DOIUrl":"https://doi.org/10.48550/arXiv.2207.10075","url":null,"abstract":"The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning of novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84771454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
DOLPHINS: Dataset for Collaborative Perception enabled Harmonious and Interconnected Self-driving 海豚:协作感知数据集,支持和谐和互联的自动驾驶
Ruiqing Mao, Jingyu Guo, Yukuan Jia, Yuxuan Sun, Sheng Zhou, Z. Niu
{"title":"DOLPHINS: Dataset for Collaborative Perception enabled Harmonious and Interconnected Self-driving","authors":"Ruiqing Mao, Jingyu Guo, Yukuan Jia, Yuxuan Sun, Sheng Zhou, Z. Niu","doi":"10.1007/978-3-031-26348-4_29","DOIUrl":"https://doi.org/10.1007/978-3-031-26348-4_29","url":null,"abstract":"","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91232292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Towards Real-time High-Definition Image Snow Removal: Efficient Pyramid Network with Asymmetrical Encoder-decoder Architecture 面向实时高清图像除雪:非对称编码器-解码器结构的高效金字塔网络
Tian Ye, Sixiang Chen, Yun Liu, Y. Ye, Erkang Chen
In winter scenes, the degradation of images taken under snow can be pretty complex, where the spatial distribution of snowy degradation is varied from image to image. Recent methods adopt deep neural networks to directly recover clean scenes from snowy images. However, due to the paradox caused by the variation of complex snowy degradation, achieving reliable High-Definition image desnowing performance in real time is a considerable challenge. We develop a novel Efficient Pyramid Network with asymmetrical encoder-decoder architecture for real-time HD image desnowing. The general idea of our proposed network is to utilize the multi-scale feature flow fully and implicitly mine clean cues from features. Compared with previous state-of-the-art desnowing methods, our approach achieves a better complexity-performance trade-off and effectively handles the processing difficulties of HD and Ultra-HD images. The extensive experiments on three large-scale image desnowing datasets demonstrate that our method surpasses all state-of-the-art approaches by a large margin both quantitatively and qualitatively, boosting the PSNR metric from 31.76 dB to 34.10 dB on the CSD test dataset and from 28.29 dB to 30.87 dB on the SRRS test dataset.
在冬季场景中,积雪下拍摄的图像的退化非常复杂,每张图像的积雪退化空间分布是不同的。最近的方法采用深度神经网络直接从雪景图像中恢复干净的场景。然而,由于复杂积雪退化变化带来的悖论,实现可靠的高清图像实时降雪性能是一个相当大的挑战。提出了一种具有非对称编解码器结构的高效金字塔网络,用于实时高清图像降噪。我们提出的网络的总体思想是充分利用多尺度特征流,并隐含地从特征中挖掘干净的线索。与以往最先进的降噪方法相比,我们的方法实现了更好的复杂性和性能权衡,有效地解决了高清和超高清图像的处理难题。在三个大规模图像降雪数据集上进行的大量实验表明,我们的方法在数量和质量上都大大超过了所有最先进的方法,在CSD测试数据集上将PSNR指标从31.76 dB提高到34.10 dB,在SRRS测试数据集上将PSNR指标从28.29 dB提高到30.87 dB。
{"title":"Towards Real-time High-Definition Image Snow Removal: Efficient Pyramid Network with Asymmetrical Encoder-decoder Architecture","authors":"Tian Ye, Sixiang Chen, Yun Liu, Y. Ye, Erkang Chen","doi":"10.48550/arXiv.2207.05605","DOIUrl":"https://doi.org/10.48550/arXiv.2207.05605","url":null,"abstract":"In winter scenes, the degradation of images taken under snow can be pretty complex, where the spatial distribution of snowy degradation is varied from image to image. Recent methods adopt deep neural networks to directly recover clean scenes from snowy images. However, due to the paradox caused by the variation of complex snowy degradation, achieving reliable High-Definition image desnowing performance in real time is a considerable challenge. We develop a novel Efficient Pyramid Network with asymmetrical encoder-decoder architecture for real-time HD image desnowing. The general idea of our proposed network is to utilize the multi-scale feature flow fully and implicitly mine clean cues from features. Compared with previous state-of-the-art desnowing methods, our approach achieves a better complexity-performance trade-off and effectively handles the processing difficulties of HD and Ultra-HD images. The extensive experiments on three large-scale image desnowing datasets demonstrate that our method surpasses all state-of-the-art approaches by a large margin both quantitatively and qualitatively, boosting the PSNR metric from 31.76 dB to 34.10 dB on the CSD test dataset and from 28.29 dB to 30.87 dB on the SRRS test dataset.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91095977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Cross-Attention Transformer for Video Interpolation 视频插值的交叉注意转换器
Hannah Kim, Shuzhi Yu, Shuaihang Yuan, Carlo Tomasi
We propose TAIN (Transformers and Attention for video INterpolation), a residual neural network for video interpolation, which aims to interpolate an intermediate frame given two consecutive image frames around it. We first present a novel vision transformer module, named Cross Similarity (CS), to globally aggregate input image features with similar appearance as those of the predicted interpolated frame. These CS features are then used to refine the interpolated prediction. To account for occlusions in the CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other. TAIN outperforms existing methods that do not require flow estimation and performs comparably to flow-based methods while being computationally efficient in terms of inference time on Vimeo90k, UCF101, and SNU-FILM benchmarks.
我们提出了TAIN (Transformers and Attention for video INterpolation),这是一个用于视频插值的残差神经网络,其目的是在给定两个连续图像帧的情况下插值中间帧。我们首先提出了一种新的视觉转换模块,称为交叉相似度(CS),用于全局聚合与预测插值帧相似的输入图像特征。然后使用这些CS特征来改进插值预测。为了考虑CS特征中的遮挡,我们提出了一个图像注意(IA)模块,允许网络从一帧中关注CS特征,而不是其他帧。在Vimeo90k、UCF101和SNU-FILM基准测试中,TAIN的性能优于不需要流量估计的现有方法,与基于流量的方法相当,同时在推理时间方面计算效率很高。
{"title":"Cross-Attention Transformer for Video Interpolation","authors":"Hannah Kim, Shuzhi Yu, Shuaihang Yuan, Carlo Tomasi","doi":"10.48550/arXiv.2207.04132","DOIUrl":"https://doi.org/10.48550/arXiv.2207.04132","url":null,"abstract":"We propose TAIN (Transformers and Attention for video INterpolation), a residual neural network for video interpolation, which aims to interpolate an intermediate frame given two consecutive image frames around it. We first present a novel vision transformer module, named Cross Similarity (CS), to globally aggregate input image features with similar appearance as those of the predicted interpolated frame. These CS features are then used to refine the interpolated prediction. To account for occlusions in the CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other. TAIN outperforms existing methods that do not require flow estimation and performs comparably to flow-based methods while being computationally efficient in terms of inference time on Vimeo90k, UCF101, and SNU-FILM benchmarks.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88659260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Network Pruning via Feature Shift Minimization 基于特征移位最小化的网络剪枝
Y. Duan, Xiaofang Hu, Yue Zhou, Peng He, Qi Liu, Shukai Duan
Channel pruning is widely used to reduce the complexity of deep network models. Recent pruning methods usually identify which parts of the network to discard by proposing a channel importance criterion. However, recent studies have shown that these criteria do not work well in all conditions. In this paper, we propose a novel Feature Shift Minimization (FSM) method to compress CNN models, which evaluates the feature shift by converging the information of both features and filters. Specifically, we first investigate the compression efficiency with some prevalent methods in different layer-depths and then propose the feature shift concept. Then, we introduce an approximation method to estimate the magnitude of the feature shift, since it is difficult to compute it directly. Besides, we present a distribution-optimization algorithm to compensate for the accuracy loss and improve the network compression efficiency. The proposed method yields state-of-the-art performance on various benchmark networks and datasets, verified by extensive experiments. Our codes are available at: https://github.com/lscgx/FSM.
通道剪枝被广泛用于降低深度网络模型的复杂性。最近的修剪方法通常通过提出信道重要性标准来确定要丢弃的网络部分。然而,最近的研究表明,这些标准并不适用于所有情况。在本文中,我们提出了一种新的特征位移最小化(FSM)方法来压缩CNN模型,该方法通过收敛特征和滤波器的信息来评估特征位移。具体来说,我们首先研究了几种常用方法在不同层深下的压缩效率,然后提出了特征移位的概念。然后,我们引入了一种近似方法来估计特征位移的大小,因为它很难直接计算。此外,我们还提出了一种分布优化算法来补偿精度损失,提高网络压缩效率。所提出的方法在各种基准网络和数据集上产生了最先进的性能,并通过大量实验进行了验证。我们的代码可在:https://github.com/lscgx/FSM。
{"title":"Network Pruning via Feature Shift Minimization","authors":"Y. Duan, Xiaofang Hu, Yue Zhou, Peng He, Qi Liu, Shukai Duan","doi":"10.48550/arXiv.2207.02632","DOIUrl":"https://doi.org/10.48550/arXiv.2207.02632","url":null,"abstract":"Channel pruning is widely used to reduce the complexity of deep network models. Recent pruning methods usually identify which parts of the network to discard by proposing a channel importance criterion. However, recent studies have shown that these criteria do not work well in all conditions. In this paper, we propose a novel Feature Shift Minimization (FSM) method to compress CNN models, which evaluates the feature shift by converging the information of both features and filters. Specifically, we first investigate the compression efficiency with some prevalent methods in different layer-depths and then propose the feature shift concept. Then, we introduce an approximation method to estimate the magnitude of the feature shift, since it is difficult to compute it directly. Besides, we present a distribution-optimization algorithm to compensate for the accuracy loss and improve the network compression efficiency. The proposed method yields state-of-the-art performance on various benchmark networks and datasets, verified by extensive experiments. Our codes are available at: https://github.com/lscgx/FSM.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83081464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1