首页 > 最新文献

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision最新文献

英文 中文
Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition 骨骼少弹动作识别的时间视点运输计划
Lei Wang, Piotr Koniusz
We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE). To factor out misalignment between query and support sequences of 3D body joints, we propose an advanced variant of Dynamic Time Warping which jointly models each smooth path between the query and support frames to achieve simultaneously the best alignment in the temporal and simulated camera viewpoint spaces for end-to-end learning under the limited few-shot training data. Sequences are encoded with a temporal block encoder based on Simple Spectral Graph Convolution, a lightweight linear Graph Neural Network backbone. We also include a setting with a transformer. Finally, we propose a similarity-based loss which encourages the alignment of sequences of the same class while preventing the alignment of unrelated sequences. We show state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II.
我们提出了一种基于关节时间和摄像机视点对齐(JEANIE)的基于三维骨架的动作识别的少镜头学习管道。为了消除三维人体关节的查询和支持序列之间的不对齐,我们提出了一种先进的动态时间翘曲方法,该方法联合建模查询和支持帧之间的每个平滑路径,以在有限的少量训练数据下同时在时间和模拟摄像机视点空间中实现端到端学习的最佳对齐。序列编码与时序块编码器基于简单谱图卷积,一个轻量级的线性图神经网络骨干。我们还包括一个带有变压器的设置。最后,我们提出了一个基于相似性的损失,它鼓励同类序列的对齐,同时防止不相关序列的对齐。我们展示了NTU-60, NTU-120, Kinetics-skeleton和UWA3D Multiview Activity II的最先进的结果。
{"title":"Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition","authors":"Lei Wang, Piotr Koniusz","doi":"10.48550/arXiv.2210.16820","DOIUrl":"https://doi.org/10.48550/arXiv.2210.16820","url":null,"abstract":"We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE). To factor out misalignment between query and support sequences of 3D body joints, we propose an advanced variant of Dynamic Time Warping which jointly models each smooth path between the query and support frames to achieve simultaneously the best alignment in the temporal and simulated camera viewpoint spaces for end-to-end learning under the limited few-shot training data. Sequences are encoded with a temporal block encoder based on Simple Spectral Graph Convolution, a lightweight linear Graph Neural Network backbone. We also include a setting with a transformer. Finally, we propose a similarity-based loss which encourages the alignment of sequences of the same class while preventing the alignment of unrelated sequences. We show state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86980335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Self-Supervised Learning with Multi-View Rendering for 3D Point Cloud Analysis 自监督学习与多视图渲染3D点云分析
Bach Tran, Binh-Son Hua, A. Tran, Minh Hoai
Recently, great progress has been made in 3D deep learning with the emergence of deep neural networks specifically designed for 3D point clouds. These networks are often trained from scratch or from pre-trained models learned purely from point cloud data. Inspired by the success of deep learning in the image domain, we devise a novel pre-training technique for better model initialization by utilizing the multi-view rendering of the 3D data. Our pre-training is self-supervised by a local pixel/point level correspondence loss computed from perspective projection and a global image/point cloud level loss based on knowledge distillation, thus effectively improving upon popular point cloud networks, including PointNet, DGCNN and SR-UNet. These improved models outperform existing state-of-the-art methods on various datasets and downstream tasks. We also analyze the benefits of synthetic and real data for pre-training, and observe that pre-training on synthetic data is also useful for high-level downstream tasks. Code and pre-trained models are available at https://github.com/VinAIResearch/selfsup_pcd.
最近,随着专门为3D点云设计的深度神经网络的出现,3D深度学习取得了很大的进展。这些网络通常是从零开始训练的,或者是纯粹从点云数据中学习的预训练模型。受深度学习在图像领域成功的启发,我们设计了一种新的预训练技术,通过利用3D数据的多视图渲染来更好地初始化模型。我们的预训练通过基于视角投影计算的局部像素/点级对应损失和基于知识蒸馏的全局图像/点云级损失进行自我监督,从而有效地改进了流行的点云网络,包括PointNet, DGCNN和SR-UNet。这些改进的模型在各种数据集和下游任务上优于现有的最先进的方法。我们还分析了合成数据和真实数据用于预训练的好处,并观察到合成数据的预训练对于高级下游任务也很有用。代码和预训练模型可在https://github.com/VinAIResearch/selfsup_pcd上获得。
{"title":"Self-Supervised Learning with Multi-View Rendering for 3D Point Cloud Analysis","authors":"Bach Tran, Binh-Son Hua, A. Tran, Minh Hoai","doi":"10.48550/arXiv.2210.15904","DOIUrl":"https://doi.org/10.48550/arXiv.2210.15904","url":null,"abstract":"Recently, great progress has been made in 3D deep learning with the emergence of deep neural networks specifically designed for 3D point clouds. These networks are often trained from scratch or from pre-trained models learned purely from point cloud data. Inspired by the success of deep learning in the image domain, we devise a novel pre-training technique for better model initialization by utilizing the multi-view rendering of the 3D data. Our pre-training is self-supervised by a local pixel/point level correspondence loss computed from perspective projection and a global image/point cloud level loss based on knowledge distillation, thus effectively improving upon popular point cloud networks, including PointNet, DGCNN and SR-UNet. These improved models outperform existing state-of-the-art methods on various datasets and downstream tasks. We also analyze the benefits of synthetic and real data for pre-training, and observe that pre-training on synthetic data is also useful for high-level downstream tasks. Code and pre-trained models are available at https://github.com/VinAIResearch/selfsup_pcd.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77939088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Complex Handwriting Trajectory Recovery: Evaluation Metrics and Algorithm 复杂笔迹轨迹恢复:评估指标和算法
Zhounan Chen, Daihui Yang, Jinglin Liang, Xinwu Liu, Yuyi Wang, Zhenghua Peng, Shuangping Huang
Many important tasks such as forensic signature verification, calligraphy synthesis, etc, rely on handwriting trajectory recovery of which, however, even an appropriate evaluation metric is still missing. Indeed, existing metrics only focus on the writing orders but overlook the fidelity of glyphs. Taking both facets into account, we come up with two new metrics, the adaptive intersection on union (AIoU) which eliminates the influence of various stroke widths, and the length-independent dynamic time warping (LDTW) which solves the trajectory-point alignment problem. After that, we then propose a novel handwriting trajectory recovery model named Parsing-and-tracing ENcoder-decoder Network (PEN-Net), in particular for characters with both complex glyph and long trajectory, which was believed very challenging. In the PEN-Net, a carefully designed double-stream parsing encoder parses the glyph structure, and a global tracing decoder overcomes the memory difficulty of long trajectory prediction. Our experiments demonstrate that the two new metrics AIoU and LDTW together can truly assess the quality of handwriting trajectory recovery and the proposed PEN-Net exhibits satisfactory performance in various complex-glyph languages including Chinese, Japanese and Indic.
法医签名验证、书法合成等许多重要的工作都依赖于笔迹轨迹的恢复,但目前还没有一个合适的评价指标。事实上,现有的指标只关注书写顺序,而忽略了字形的保真度。考虑到这两个方面,我们提出了两个新的度量,即消除各种冲程宽度影响的自适应交联度量(AIoU)和解决轨迹点对齐问题的与长度无关的动态时间规整度量(LDTW)。在此基础上,我们提出了一种新的手写轨迹恢复模型——解析与跟踪编码器-解码器网络(PEN-Net),特别是对于复杂字形和长轨迹的汉字,这是非常具有挑战性的。在PEN-Net中,精心设计的双流解析编码器对字形结构进行解析,全局跟踪解码器克服了长轨迹预测的记忆困难。我们的实验表明,AIoU和LDTW两个新指标可以真实地评估手写轨迹恢复的质量,并且所提出的PEN-Net在包括汉语、日语和印度语在内的各种复杂字形语言中都表现出令人满意的性能。
{"title":"Complex Handwriting Trajectory Recovery: Evaluation Metrics and Algorithm","authors":"Zhounan Chen, Daihui Yang, Jinglin Liang, Xinwu Liu, Yuyi Wang, Zhenghua Peng, Shuangping Huang","doi":"10.48550/arXiv.2210.15879","DOIUrl":"https://doi.org/10.48550/arXiv.2210.15879","url":null,"abstract":"Many important tasks such as forensic signature verification, calligraphy synthesis, etc, rely on handwriting trajectory recovery of which, however, even an appropriate evaluation metric is still missing. Indeed, existing metrics only focus on the writing orders but overlook the fidelity of glyphs. Taking both facets into account, we come up with two new metrics, the adaptive intersection on union (AIoU) which eliminates the influence of various stroke widths, and the length-independent dynamic time warping (LDTW) which solves the trajectory-point alignment problem. After that, we then propose a novel handwriting trajectory recovery model named Parsing-and-tracing ENcoder-decoder Network (PEN-Net), in particular for characters with both complex glyph and long trajectory, which was believed very challenging. In the PEN-Net, a carefully designed double-stream parsing encoder parses the glyph structure, and a global tracing decoder overcomes the memory difficulty of long trajectory prediction. Our experiments demonstrate that the two new metrics AIoU and LDTW together can truly assess the quality of handwriting trajectory recovery and the proposed PEN-Net exhibits satisfactory performance in various complex-glyph languages including Chinese, Japanese and Indic.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74883232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Spatio-channel Attention Blocks for Cross-modal Crowd Counting 跨模态人群计数的空间通道注意块
Youjia Zhang, Soyun Choi, Sungeun Hong
Crowd counting research has made significant advancements in real-world applications, but it remains a formidable challenge in cross-modal settings. Most existing methods rely solely on the optical features of RGB images, ignoring the feasibility of other modalities such as thermal and depth images. The inherently significant differences between the different modalities and the diversity of design choices for model architectures make cross-modal crowd counting more challenging. In this paper, we propose Cross-modal Spatio-Channel Attention (CSCA) blocks, which can be easily integrated into any modality-specific architecture. The CSCA blocks first spatially capture global functional correlations among multi-modality with less overhead through spatial-wise cross-modal attention. Cross-modal features with spatial attention are subsequently refined through adaptive channel-wise feature aggregation. In our experiments, the proposed block consistently shows significant performance improvement across various backbone networks, resulting in state-of-the-art results in RGB-T and RGB-D crowd counting.
人群计数研究在实际应用中取得了重大进展,但在跨模式环境中仍然是一个巨大的挑战。大多数现有的方法仅仅依赖于RGB图像的光学特征,而忽略了其他模态如热图像和深度图像的可行性。不同模态之间固有的显著差异以及模型架构设计选择的多样性使得跨模态人群计数更具挑战性。在本文中,我们提出了跨模态空间通道注意(CSCA)块,它可以很容易地集成到任何特定于模态的架构中。CSCA模块首先在空间上捕获多模态之间的全局功能相关性,通过空间上的跨模态注意减少开销。具有空间关注的跨模态特征随后通过自适应通道特征聚合进行细化。在我们的实验中,提出的区块在各种骨干网中始终显示出显着的性能改进,从而在RGB-T和RGB-D人群计数中获得了最先进的结果。
{"title":"Spatio-channel Attention Blocks for Cross-modal Crowd Counting","authors":"Youjia Zhang, Soyun Choi, Sungeun Hong","doi":"10.48550/arXiv.2210.10392","DOIUrl":"https://doi.org/10.48550/arXiv.2210.10392","url":null,"abstract":"Crowd counting research has made significant advancements in real-world applications, but it remains a formidable challenge in cross-modal settings. Most existing methods rely solely on the optical features of RGB images, ignoring the feasibility of other modalities such as thermal and depth images. The inherently significant differences between the different modalities and the diversity of design choices for model architectures make cross-modal crowd counting more challenging. In this paper, we propose Cross-modal Spatio-Channel Attention (CSCA) blocks, which can be easily integrated into any modality-specific architecture. The CSCA blocks first spatially capture global functional correlations among multi-modality with less overhead through spatial-wise cross-modal attention. Cross-modal features with spatial attention are subsequently refined through adaptive channel-wise feature aggregation. In our experiments, the proposed block consistently shows significant performance improvement across various backbone networks, resulting in state-of-the-art results in RGB-T and RGB-D crowd counting.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72643833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Lightweight Alpha Matting Network Using Distillation-Based Channel Pruning 使用基于蒸馏的通道修剪的轻量级Alpha抠图网络
Donggeun Yoon, Jinsun Park, Donghyeon Cho
Recently, alpha matting has received a lot of attention because of its usefulness in mobile applications such as selfies. Therefore, there has been a demand for a lightweight alpha matting model due to the limited computational resources of commercial portable devices. To this end, we suggest a distillation-based channel pruning method for the alpha matting networks. In the pruning step, we remove channels of a student network having fewer impacts on mimicking the knowledge of a teacher network. Then, the pruned lightweight student network is trained by the same distillation loss. A lightweight alpha matting model from the proposed method outperforms existing lightweight methods. To show superiority of our algorithm, we provide various quantitative and qualitative experiments with in-depth analyses. Furthermore, we demonstrate the versatility of the proposed distillation-based channel pruning method by applying it to semantic segmentation.
最近,alpha抠图因其在自拍等移动应用中的实用性而受到了广泛关注。因此,由于商业便携式设备的计算资源有限,对轻量级alpha抠图模型的需求一直存在。为此,我们提出了一种基于提取的alpha抠图网络通道修剪方法。在修剪步骤中,我们删除了对模仿教师网络知识影响较小的学生网络的通道。然后,用同样的蒸馏损失训练经过修剪的轻量级学生网络。该方法的轻量级alpha抠图模型优于现有的轻量级方法。为了证明我们算法的优越性,我们提供了各种定量和定性实验,并进行了深入的分析。此外,我们通过将所提出的基于蒸馏的通道修剪方法应用于语义分割来证明其通用性。
{"title":"Lightweight Alpha Matting Network Using Distillation-Based Channel Pruning","authors":"Donggeun Yoon, Jinsun Park, Donghyeon Cho","doi":"10.48550/arXiv.2210.07760","DOIUrl":"https://doi.org/10.48550/arXiv.2210.07760","url":null,"abstract":"Recently, alpha matting has received a lot of attention because of its usefulness in mobile applications such as selfies. Therefore, there has been a demand for a lightweight alpha matting model due to the limited computational resources of commercial portable devices. To this end, we suggest a distillation-based channel pruning method for the alpha matting networks. In the pruning step, we remove channels of a student network having fewer impacts on mimicking the knowledge of a teacher network. Then, the pruned lightweight student network is trained by the same distillation loss. A lightweight alpha matting model from the proposed method outperforms existing lightweight methods. To show superiority of our algorithm, we provide various quantitative and qualitative experiments with in-depth analyses. Furthermore, we demonstrate the versatility of the proposed distillation-based channel pruning method by applying it to semantic segmentation.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77339948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COLLIDER: A Robust Training Framework for Backdoor Data COLLIDER:一个强健的后门数据训练框架
H. M. Dolatabadi, S. Erfani, C. Leckie
Deep neural network (DNN) classifiers are vulnerable to backdoor attacks. An adversary poisons some of the training data in such attacks by installing a trigger. The goal is to make the trained DNN output the attacker's desired class whenever the trigger is activated while performing as usual for clean data. Various approaches have recently been proposed to detect malicious backdoored DNNs. However, a robust, end-to-end training approach, like adversarial training, is yet to be discovered for backdoor poisoned data. In this paper, we take the first step toward such methods by developing a robust training framework, COLLIDER, that selects the most prominent samples by exploiting the underlying geometric structures of the data. Specifically, we effectively filter out candidate poisoned data at each training epoch by solving a geometrical coreset selection objective. We first argue how clean data samples exhibit (1) gradients similar to the clean majority of data and (2) low local intrinsic dimensionality (LID). Based on these criteria, we define a novel coreset selection objective to find such samples, which are used for training a DNN. We show the effectiveness of the proposed method for robust training of DNNs on various poisoned datasets, reducing the backdoor success rate significantly.
深度神经网络(DNN)分类器容易受到后门攻击。攻击者通过安装触发器来毒害此类攻击中的一些训练数据。目标是使训练好的DNN输出成为攻击者想要的类,当触发器被激活时,同时对干净的数据像往常一样执行。最近提出了各种方法来检测恶意后门dnn。然而,一种健壮的端到端训练方法,如对抗性训练,尚未被发现用于后门中毒数据。在本文中,我们通过开发一个健壮的训练框架COLLIDER迈出了迈向这些方法的第一步,COLLIDER通过利用数据的底层几何结构来选择最突出的样本。具体来说,我们通过求解几何核心集选择目标,在每个训练历元有效地过滤掉候选有毒数据。我们首先讨论干净的数据样本如何表现出(1)与干净的大多数数据相似的梯度和(2)低局部固有维数(LID)。基于这些标准,我们定义了一个新的核心集选择目标来寻找这些样本,这些样本用于训练深度神经网络。我们证明了所提出的方法在各种有毒数据集上对dnn进行鲁棒训练的有效性,显著降低了后门成功率。
{"title":"COLLIDER: A Robust Training Framework for Backdoor Data","authors":"H. M. Dolatabadi, S. Erfani, C. Leckie","doi":"10.48550/arXiv.2210.06704","DOIUrl":"https://doi.org/10.48550/arXiv.2210.06704","url":null,"abstract":"Deep neural network (DNN) classifiers are vulnerable to backdoor attacks. An adversary poisons some of the training data in such attacks by installing a trigger. The goal is to make the trained DNN output the attacker's desired class whenever the trigger is activated while performing as usual for clean data. Various approaches have recently been proposed to detect malicious backdoored DNNs. However, a robust, end-to-end training approach, like adversarial training, is yet to be discovered for backdoor poisoned data. In this paper, we take the first step toward such methods by developing a robust training framework, COLLIDER, that selects the most prominent samples by exploiting the underlying geometric structures of the data. Specifically, we effectively filter out candidate poisoned data at each training epoch by solving a geometrical coreset selection objective. We first argue how clean data samples exhibit (1) gradients similar to the clean majority of data and (2) low local intrinsic dimensionality (LID). Based on these criteria, we define a novel coreset selection objective to find such samples, which are used for training a DNN. We show the effectiveness of the proposed method for robust training of DNNs on various poisoned datasets, reducing the backdoor success rate significantly.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86473906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Fine-Grained Image Style Transfer with Visual Transformers 细粒度图像风格转移与视觉变压器
Jianbo Wang, Huan Yang, Jianlong Fu, T. Yamasaki, B. Guo
With the development of the convolutional neural network, image style transfer has drawn increasing attention. However, most existing approaches adopt a global feature transformation to transfer style patterns into content images (e.g., AdaIN and WCT). Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results. To solve this problem, we propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation. Specifically, two attention mechanisms are adopted in our STTR. We first propose to use self-attention to encode content and style tokens such that similar tokens can be grouped and learned together. We then adopt cross-attention between content and style tokens that encourages fine-grained style transformations. To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk (AMT), which are carried out with 50 human subjects with 1,000 votes in total. Extensive evaluations demonstrate the effectiveness and efficiency of the proposed STTR in generating visually pleasing style transfer results.
随着卷积神经网络的发展,图像风格迁移越来越受到人们的关注。然而,大多数现有的方法采用全局特征转换来将样式模式转移到内容图像中(例如AdaIN和WCT)。这样的设计通常会破坏输入图像的空间信息,无法将细粒度的风格模式转化为风格转移结果。为了解决这个问题,我们提出了一种新的风格转换(STTR)网络,它将内容和风格图像分解为视觉标记,以实现细粒度的风格转换。具体来说,我们的STTR采用了两种注意机制。我们首先提出使用自关注对内容和样式标记进行编码,使相似的标记可以分组并一起学习。然后,我们采用内容和样式标记之间的交叉关注,以鼓励细粒度的样式转换。为了将STTR与现有方法进行比较,我们对Amazon Mechanical Turk (AMT)进行了用户研究,该研究由50名人类受试者进行,总共有1000票。广泛的评估证明了所建议的STTR在产生视觉上令人愉悦的风格迁移结果方面的有效性和效率。
{"title":"Fine-Grained Image Style Transfer with Visual Transformers","authors":"Jianbo Wang, Huan Yang, Jianlong Fu, T. Yamasaki, B. Guo","doi":"10.48550/arXiv.2210.05176","DOIUrl":"https://doi.org/10.48550/arXiv.2210.05176","url":null,"abstract":"With the development of the convolutional neural network, image style transfer has drawn increasing attention. However, most existing approaches adopt a global feature transformation to transfer style patterns into content images (e.g., AdaIN and WCT). Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results. To solve this problem, we propose a novel STyle TRansformer (STTR) network which breaks both content and style images into visual tokens to achieve a fine-grained style transformation. Specifically, two attention mechanisms are adopted in our STTR. We first propose to use self-attention to encode content and style tokens such that similar tokens can be grouped and learned together. We then adopt cross-attention between content and style tokens that encourages fine-grained style transformations. To compare STTR with existing approaches, we conduct user studies on Amazon Mechanical Turk (AMT), which are carried out with 50 human subjects with 1,000 votes in total. Extensive evaluations demonstrate the effectiveness and efficiency of the proposed STTR in generating visually pleasing style transfer results.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80659278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Robust Human Matting via Semantic Guidance 基于语义引导的鲁棒人类抠图
Xian-cai Chen, Ye Zhu, Yu Li, Bingtao Fu, Lei Sun, Ying Shan, Shan-shan Liu
Automatic human matting is highly desired for many real applications. We investigate recent human matting methods and show that common bad cases happen when semantic human segmentation fails. This indicates that semantic understanding is crucial for robust human matting. From this, we develop a fast yet accurate human matting framework, named Semantic Guided Human Matting (SGHM). It builds on a semantic human segmentation network and introduces a light-weight matting module with only marginal computational cost. Unlike previous works, our framework is data efficient, which requires a small amount of matting ground-truth to learn to estimate high quality object mattes. Our experiments show that trained with merely 200 matting images, our method can generalize well to real-world datasets, and outperform recent methods on multiple benchmarks, while remaining efficient. Considering the unbearable labeling cost of matting data and widely available segmentation data, our method becomes a practical and effective solution for the task of human matting. Source code is available at https://github.com/cxgincsu/SemanticGuidedHumanMatting.
在许多实际应用中,自动人抠图是非常需要的。我们研究了最近的人类拼接方法,并表明当语义人类分割失败时会发生常见的糟糕情况。这表明语义理解对于健壮的人类匹配至关重要。在此基础上,我们开发了一个快速而准确的人类抠图框架,称为语义引导的人类抠图(SGHM)。它建立在语义人类分割网络的基础上,并引入了一个轻量级的抠图模块,只有边际计算成本。与以前的工作不同,我们的框架是数据高效的,它需要少量的抠图ground-truth来学习估计高质量的对象抠图。我们的实验表明,仅使用200张抠图图像进行训练,我们的方法可以很好地推广到现实世界的数据集,并且在多个基准测试中优于最近的方法,同时保持效率。考虑到抠图数据难以承受的标注成本和广泛可用的分割数据,我们的方法成为人工抠图任务的一种实用有效的解决方案。源代码可从https://github.com/cxgincsu/SemanticGuidedHumanMatting获得。
{"title":"Robust Human Matting via Semantic Guidance","authors":"Xian-cai Chen, Ye Zhu, Yu Li, Bingtao Fu, Lei Sun, Ying Shan, Shan-shan Liu","doi":"10.48550/arXiv.2210.05210","DOIUrl":"https://doi.org/10.48550/arXiv.2210.05210","url":null,"abstract":"Automatic human matting is highly desired for many real applications. We investigate recent human matting methods and show that common bad cases happen when semantic human segmentation fails. This indicates that semantic understanding is crucial for robust human matting. From this, we develop a fast yet accurate human matting framework, named Semantic Guided Human Matting (SGHM). It builds on a semantic human segmentation network and introduces a light-weight matting module with only marginal computational cost. Unlike previous works, our framework is data efficient, which requires a small amount of matting ground-truth to learn to estimate high quality object mattes. Our experiments show that trained with merely 200 matting images, our method can generalize well to real-world datasets, and outperform recent methods on multiple benchmarks, while remaining efficient. Considering the unbearable labeling cost of matting data and widely available segmentation data, our method becomes a practical and effective solution for the task of human matting. Source code is available at https://github.com/cxgincsu/SemanticGuidedHumanMatting.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78588589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Deep Active Ensemble Sampling For Image Classification 用于图像分类的深度主动集成采样
S. Mohamadi, Gianfranco Doretto, D. Adjeroh
Conventional active learning (AL) frameworks aim to reduce the cost of data annotation by actively requesting the labeling for the most informative data points. However, introducing AL to data hungry deep learning algorithms has been a challenge. Some proposed approaches include uncertainty-based techniques, geometric methods, implicit combination of uncertainty-based and geometric approaches, and more recently, frameworks based on semi/self supervised techniques. In this paper, we address two specific problems in this area. The first is the need for efficient exploitation/exploration trade-off in sample selection in AL. For this, we present an innovative integration of recent progress in both uncertainty-based and geometric frameworks to enable an efficient exploration/exploitation trade-off in sample selection strategy. To this end, we build on a computationally efficient approximate of Thompson sampling with key changes as a posterior estimator for uncertainty representation. Our framework provides two advantages: (1) accurate posterior estimation, and (2) tune-able trade-off between computational overhead and higher accuracy. The second problem is the need for improved training protocols in deep AL. For this, we use ideas from semi/self supervised learning to propose a general approach that is independent of the specific AL technique being used. Taken these together, our framework shows a significant improvement over the state-of-the-art, with results that are comparable to the performance of supervised-learning under the same setting. We show empirical results of our framework, and comparative performance with the state-of-the-art on four datasets, namely, MNIST, CIFAR10, CIFAR100 and ImageNet to establish a new baseline in two different settings.
传统的主动学习(AL)框架旨在通过主动请求标注信息量最大的数据点来降低数据标注的成本。然而,将人工智能引入需要大量数据的深度学习算法一直是一个挑战。一些提出的方法包括基于不确定性的技术、几何方法、基于不确定性和几何方法的隐式组合,以及最近基于半/自监督技术的框架。在本文中,我们讨论了这一领域的两个具体问题。首先是人工智能在样本选择中需要有效的开发/探索权衡。为此,我们提出了基于不确定性和几何框架的最新进展的创新集成,以实现样本选择策略中有效的探索/开发权衡。为此,我们建立了一个具有关键变化的计算效率近似的汤普森采样作为不确定性表示的后验估计。我们的框架提供了两个优点:(1)准确的后验估计,(2)计算开销和更高精度之间的可调权衡。第二个问题是需要改进深度人工智能的训练协议。为此,我们使用半/自监督学习的思想来提出一种独立于所使用的特定人工智能技术的通用方法。综上所述,我们的框架显示出了对最先进技术的显著改进,其结果与相同设置下监督学习的性能相当。我们展示了我们的框架的实证结果,并在四个数据集(即MNIST, CIFAR10, CIFAR100和ImageNet)上与最先进的性能进行了比较,以在两种不同的设置下建立新的基线。
{"title":"Deep Active Ensemble Sampling For Image Classification","authors":"S. Mohamadi, Gianfranco Doretto, D. Adjeroh","doi":"10.48550/arXiv.2210.05770","DOIUrl":"https://doi.org/10.48550/arXiv.2210.05770","url":null,"abstract":"Conventional active learning (AL) frameworks aim to reduce the cost of data annotation by actively requesting the labeling for the most informative data points. However, introducing AL to data hungry deep learning algorithms has been a challenge. Some proposed approaches include uncertainty-based techniques, geometric methods, implicit combination of uncertainty-based and geometric approaches, and more recently, frameworks based on semi/self supervised techniques. In this paper, we address two specific problems in this area. The first is the need for efficient exploitation/exploration trade-off in sample selection in AL. For this, we present an innovative integration of recent progress in both uncertainty-based and geometric frameworks to enable an efficient exploration/exploitation trade-off in sample selection strategy. To this end, we build on a computationally efficient approximate of Thompson sampling with key changes as a posterior estimator for uncertainty representation. Our framework provides two advantages: (1) accurate posterior estimation, and (2) tune-able trade-off between computational overhead and higher accuracy. The second problem is the need for improved training protocols in deep AL. For this, we use ideas from semi/self supervised learning to propose a general approach that is independent of the specific AL technique being used. Taken these together, our framework shows a significant improvement over the state-of-the-art, with results that are comparable to the performance of supervised-learning under the same setting. We show empirical results of our framework, and comparative performance with the state-of-the-art on four datasets, namely, MNIST, CIFAR10, CIFAR100 and ImageNet to establish a new baseline in two different settings.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85721765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
DCVQE: A Hierarchical Transformer for Video Quality Assessment DCVQE:一种用于视频质量评估的分层变压器
Zu-Hua Li, Lei Yang
The explosion of user-generated videos stimulates a great demand for no-reference video quality assessment (NR-VQA). Inspired by our observation on the actions of human annotation, we put forward a Divide and Conquer Video Quality Estimator (DCVQE) for NR-VQA. Starting from extracting the frame-level quality embeddings (QE), our proposal splits the whole sequence into a number of clips and applies Transformers to learn the clip-level QE and update the frame-level QE simultaneously; another Transformer is introduced to combine the clip-level QE to generate the video-level QE. We call this hierarchical combination of Transformers as a Divide and Conquer Transformer (DCTr) layer. An accurate video quality feature extraction can be achieved by repeating the process of this DCTr layer several times. Taking the order relationship among the annotated data into account, we also propose a novel correlation loss term for model training. Experiments on various datasets confirm the effectiveness and robustness of our DCVQE model.
用户生成视频的爆炸式增长刺激了对无参考视频质量评估(NR-VQA)的巨大需求。基于对人类注释行为的观察,我们提出了一种针对NR-VQA的分而治之的视频质量估计器(DCVQE)。从提取帧级质量嵌入(QE)开始,将整个序列分割为多个片段,并应用transformer学习帧级质量嵌入,同时更新帧级质量嵌入;引入了另一个变压器来组合剪辑级QE以生成视频级QE。我们把这种变压器的分层组合称为分而治之变压器(DCTr)层。通过多次重复该DCTr层的过程,可以获得准确的视频质量特征提取。考虑到标注数据之间的顺序关系,我们还提出了一种新的相关损失项用于模型训练。在不同数据集上的实验验证了DCVQE模型的有效性和鲁棒性。
{"title":"DCVQE: A Hierarchical Transformer for Video Quality Assessment","authors":"Zu-Hua Li, Lei Yang","doi":"10.48550/arXiv.2210.04377","DOIUrl":"https://doi.org/10.48550/arXiv.2210.04377","url":null,"abstract":"The explosion of user-generated videos stimulates a great demand for no-reference video quality assessment (NR-VQA). Inspired by our observation on the actions of human annotation, we put forward a Divide and Conquer Video Quality Estimator (DCVQE) for NR-VQA. Starting from extracting the frame-level quality embeddings (QE), our proposal splits the whole sequence into a number of clips and applies Transformers to learn the clip-level QE and update the frame-level QE simultaneously; another Transformer is introduced to combine the clip-level QE to generate the video-level QE. We call this hierarchical combination of Transformers as a Divide and Conquer Transformer (DCTr) layer. An accurate video quality feature extraction can be achieved by repeating the process of this DCTr layer several times. Taking the order relationship among the annotated data into account, we also propose a novel correlation loss term for model training. Experiments on various datasets confirm the effectiveness and robustness of our DCVQE model.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75942421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1