首页 > 最新文献

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision最新文献

英文 中文
Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward 基于重要-连贯奖励的广告视频多模态片段组合网络
Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng
Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. It mainly contains two stages: video segmentation and segment assemblage. The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage. To address these problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can perform efficient and coherent segment assemblage task end-to-end. It utilizes multi-modal representation extracted from the segments and follows the Encoder-Decoder Ptr-Net framework with the Attention mechanism. Importance-coherence reward is designed for training M-SAN. We experiment on the Ads-1k dataset with 1000+ videos under rich ad scenarios collected from advertisers. To evaluate the methods, we propose a unified metric, Imp-Coh@Time, which comprehensively assesses the importance, coherence, and duration of the outputs at the same time. Experimental results show that our method achieves better performance than random selection and the previous method on the metric. Ablation experiments further verify that multi-modal representation and importance-coherence reward significantly improve the performance. Ads-1k dataset is available at: https://github.com/yunlong10/Ads-1k
广告视频编辑的目的是将广告视频自动编辑成更短的视频,同时保留广告客户传达的连贯内容和关键信息。它主要包括两个阶段:视频分割和片段拼接。现有方法在视频分割阶段表现良好,但存在依赖于额外繁琐的模型和片段组合阶段表现不佳的问题。为了解决这些问题,我们提出了M-SAN (Multi-modal Segment assemble Network,多模态分段装配网络),该网络可以端到端执行高效、连贯的分段装配任务。它利用从片段中提取的多模态表示,并遵循带有注意机制的编码器-解码器Ptr-Net框架。重要性-一致性奖励是为训练M-SAN而设计的。我们在Ads-1k数据集上进行了实验,从广告商那里收集了1000多个丰富广告场景下的视频。为了评估这些方法,我们提出了一个统一的度量,Imp-Coh@Time,它同时全面评估产出的重要性、一致性和持续时间。实验结果表明,该方法在度量上比随机选择和之前的方法取得了更好的性能。消融实验进一步验证了多模态表征和重要相干奖励显著提高了性能。Ads-1k数据集可在:https://github.com/yunlong10/Ads-1k
{"title":"Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward","authors":"Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng","doi":"10.48550/arXiv.2209.12164","DOIUrl":"https://doi.org/10.48550/arXiv.2209.12164","url":null,"abstract":"Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. It mainly contains two stages: video segmentation and segment assemblage. The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage. To address these problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can perform efficient and coherent segment assemblage task end-to-end. It utilizes multi-modal representation extracted from the segments and follows the Encoder-Decoder Ptr-Net framework with the Attention mechanism. Importance-coherence reward is designed for training M-SAN. We experiment on the Ads-1k dataset with 1000+ videos under rich ad scenarios collected from advertisers. To evaluate the methods, we propose a unified metric, Imp-Coh@Time, which comprehensively assesses the importance, coherence, and duration of the outputs at the same time. Experimental results show that our method achieves better performance than random selection and the previous method on the metric. Ablation experiments further verify that multi-modal representation and importance-coherence reward significantly improve the performance. Ads-1k dataset is available at: https://github.com/yunlong10/Ads-1k","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73476212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
D$^{bf{3}}$: Duplicate Detection Decontaminator for Multi-Athlete Tracking in Sports Videos D$^{bf{3}}$:运动视频中多运动员跟踪的重复检测去污器
Rui He, Zehua Fu, Qingjie Liu, Yunhong Wang, Xunxun Chen
Tracking multiple athletes in sports videos is a very challenging Multi-Object Tracking (MOT) task, since athletes often have the same appearance and are intimately covered with each other, making a common occlusion problem becomes an abhorrent duplicate detection. In this paper, the duplicate detection is newly and precisely defined as occlusion misreporting on the same athlete by multiple detection boxes in one frame. To address this problem, we meticulously design a novel transformer-based Duplicate Detection Decontaminator (D$^3$) for training, and a specific algorithm Rally-Hungarian (RH) for matching. Once duplicate detection occurs, D$^3$ immediately modifies the procedure by generating enhanced boxes losses. RH, triggered by the team sports substitution rules, is exceedingly suitable for sports videos. Moreover, to complement the tracking dataset that without shot changes, we release a new dataset based on sports video named RallyTrack. Extensive experiments on RallyTrack show that combining D$^3$ and RH can dramatically improve the tracking performance with 9.2 in MOTA and 4.5 in HOTA. Meanwhile, experiments on MOT-series and DanceTrack discover that D$^3$ can accelerate convergence during training, especially save up to 80 percent of the original training time on MOT17. Finally, our model, which is trained only with volleyball videos, can be applied directly to basketball and soccer videos for MAT, which shows priority of our method. Our dataset is available at https://github.com/heruihr/rallytrack.
在运动视频中跟踪多个运动员是一项非常具有挑战性的多目标跟踪(MOT)任务,因为运动员通常具有相同的外观并且彼此紧密覆盖,使得常见的遮挡问题成为令人讨厌的重复检测。本文将重复检测精确定义为一帧内多个检测盒对同一运动员的遮挡误报。为了解决这个问题,我们精心设计了一种新的基于变压器的重复检测去污器(D$^3$)用于训练,并设计了一种特定的rly - hungarian (RH)算法用于匹配。一旦发现重复,D$^3$立即修改程序,产生增强的盒损失。由团队运动换人规则引发的RH非常适合于体育视频。此外,为了补充没有镜头变化的跟踪数据集,我们发布了一个新的基于体育视频的数据集RallyTrack。RallyTrack上的大量实验表明,将D$^3$和RH结合可以显著提高跟踪性能,在MOTA中达到9.2,在HOTA中达到4.5。同时,在MOT-series和DanceTrack上的实验发现,D$^3$可以加速训练过程中的收敛,特别是在MOT17上节省了高达80%的原始训练时间。最后,我们的模型可以直接应用到篮球和足球视频的MAT中,这表明了我们方法的优先性。我们的数据集可以在https://github.com/heruihr/rallytrack上找到。
{"title":"D$^{bf{3}}$: Duplicate Detection Decontaminator for Multi-Athlete Tracking in Sports Videos","authors":"Rui He, Zehua Fu, Qingjie Liu, Yunhong Wang, Xunxun Chen","doi":"10.48550/arXiv.2209.12248","DOIUrl":"https://doi.org/10.48550/arXiv.2209.12248","url":null,"abstract":"Tracking multiple athletes in sports videos is a very challenging Multi-Object Tracking (MOT) task, since athletes often have the same appearance and are intimately covered with each other, making a common occlusion problem becomes an abhorrent duplicate detection. In this paper, the duplicate detection is newly and precisely defined as occlusion misreporting on the same athlete by multiple detection boxes in one frame. To address this problem, we meticulously design a novel transformer-based Duplicate Detection Decontaminator (D$^3$) for training, and a specific algorithm Rally-Hungarian (RH) for matching. Once duplicate detection occurs, D$^3$ immediately modifies the procedure by generating enhanced boxes losses. RH, triggered by the team sports substitution rules, is exceedingly suitable for sports videos. Moreover, to complement the tracking dataset that without shot changes, we release a new dataset based on sports video named RallyTrack. Extensive experiments on RallyTrack show that combining D$^3$ and RH can dramatically improve the tracking performance with 9.2 in MOTA and 4.5 in HOTA. Meanwhile, experiments on MOT-series and DanceTrack discover that D$^3$ can accelerate convergence during training, especially save up to 80 percent of the original training time on MOT17. Finally, our model, which is trained only with volleyball videos, can be applied directly to basketball and soccer videos for MAT, which shows priority of our method. Our dataset is available at https://github.com/heruihr/rallytrack.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88244770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Simple Strategy to Provable Invariance via Orbit Mapping 利用轨道映射实现可证明不变性的简单策略
Kanchana Vaishnavi Gandikota, Jonas Geiping, Zorah Lahner, Adam Czapli'nski, Michael Moeller
Many applications require robustness, or ideally invariance, of neural networks to certain transformations of input data. Most commonly, this requirement is addressed by training data augmentation, using adversarial training, or defining network architectures that include the desired invariance by design. In this work, we propose a method to make network architectures provably invariant with respect to group actions by choosing one element from a (possibly continuous) orbit based on a fixed criterion. In a nutshell, we intend to 'undo' any possible transformation before feeding the data into the actual network. Further, we empirically analyze the properties of different approaches which incorporate invariance via training or architecture, and demonstrate the advantages of our method in terms of robustness and computational efficiency. In particular, we investigate the robustness with respect to rotations of images (which can hold up to discretization artifacts) as well as the provable orientation and scaling invariance of 3D point cloud classification.
许多应用程序要求神经网络对输入数据的某些转换具有鲁棒性或理想的不变性。最常见的是,这一需求可以通过训练数据增强、使用对抗性训练或定义设计中包含所需不变性的网络体系结构来解决。在这项工作中,我们提出了一种方法,通过基于固定准则从(可能是连续的)轨道中选择一个元素,使网络架构相对于群体行为具有可证明的不变性。简而言之,我们打算在将数据输入实际网络之前“撤销”任何可能的转换。此外,我们通过训练或架构经验分析了包含不变性的不同方法的特性,并证明了我们的方法在鲁棒性和计算效率方面的优势。特别是,我们研究了关于图像旋转的鲁棒性(可以保持离散化伪像)以及3D点云分类的可证明的方向和缩放不变性。
{"title":"A Simple Strategy to Provable Invariance via Orbit Mapping","authors":"Kanchana Vaishnavi Gandikota, Jonas Geiping, Zorah Lahner, Adam Czapli'nski, Michael Moeller","doi":"10.48550/arXiv.2209.11916","DOIUrl":"https://doi.org/10.48550/arXiv.2209.11916","url":null,"abstract":"Many applications require robustness, or ideally invariance, of neural networks to certain transformations of input data. Most commonly, this requirement is addressed by training data augmentation, using adversarial training, or defining network architectures that include the desired invariance by design. In this work, we propose a method to make network architectures provably invariant with respect to group actions by choosing one element from a (possibly continuous) orbit based on a fixed criterion. In a nutshell, we intend to 'undo' any possible transformation before feeding the data into the actual network. Further, we empirically analyze the properties of different approaches which incorporate invariance via training or architecture, and demonstrate the advantages of our method in terms of robustness and computational efficiency. In particular, we investigate the robustness with respect to rotations of images (which can hold up to discretization artifacts) as well as the provable orientation and scaling invariance of 3D point cloud classification.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86495817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Modular Degradation Simulation and Restoration for Under-Display Camera 显示下相机的模块化退化仿真与恢复
Yang Zhou, Yuda Song, Xin Du
Under-display camera (UDC) provides an elegant solution for full-screen smartphones. However, UDC captured images suffer from severe degradation since sensors lie under the display. Although this issue can be tackled by image restoration networks, these networks require large-scale image pairs for training. To this end, we propose a modular network dubbed MPGNet trained using the generative adversarial network (GAN) framework for simulating UDC imaging. Specifically, we note that the UDC imaging degradation process contains brightness attenuation, blurring, and noise corruption. Thus we model each degradation with a characteristic-related modular network, and all modular networks are cascaded to form the generator. Together with a pixel-wise discriminator and supervised loss, we can train the generator to simulate the UDC imaging degradation process. Furthermore, we present a Transformer-style network named DWFormer for UDC image restoration. For practical purposes, we use depth-wise convolution instead of the multi-head self-attention to aggregate local spatial information. Moreover, we propose a novel channel attention module to aggregate global information, which is critical for brightness recovery. We conduct evaluations on the UDC benchmark, and our method surpasses the previous state-of-the-art models by 1.23 dB on the P-OLED track and 0.71 dB on the T-OLED track, respectively.
显示屏下摄像头(UDC)为全屏智能手机提供了一种优雅的解决方案。然而,由于传感器位于显示器下方,UDC捕获的图像遭受严重的退化。虽然这个问题可以通过图像恢复网络来解决,但这些网络需要大规模的图像对进行训练。为此,我们提出了一个模块化网络,称为MPGNet,使用生成对抗网络(GAN)框架进行训练,用于模拟UDC成像。具体来说,我们注意到UDC成像退化过程包含亮度衰减、模糊和噪声损坏。因此,我们用一个与特征相关的模块网络对每个退化进行建模,并且所有模块网络被级联以形成生成器。结合逐像素识别器和监督损失,我们可以训练生成器来模拟UDC图像退化过程。此外,我们提出了一个名为DWFormer的用于UDC图像恢复的transformer风格网络。出于实际目的,我们使用深度卷积来代替多头自关注来聚合局部空间信息。此外,我们还提出了一种新的通道关注模块来聚合全局信息,这对亮度恢复至关重要。我们对UDC基准进行了评估,我们的方法在P-OLED轨道上分别超过了1.23 dB和0.71 dB,在T-OLED轨道上分别超过了以前最先进的模型。
{"title":"Modular Degradation Simulation and Restoration for Under-Display Camera","authors":"Yang Zhou, Yuda Song, Xin Du","doi":"10.48550/arXiv.2209.11455","DOIUrl":"https://doi.org/10.48550/arXiv.2209.11455","url":null,"abstract":"Under-display camera (UDC) provides an elegant solution for full-screen smartphones. However, UDC captured images suffer from severe degradation since sensors lie under the display. Although this issue can be tackled by image restoration networks, these networks require large-scale image pairs for training. To this end, we propose a modular network dubbed MPGNet trained using the generative adversarial network (GAN) framework for simulating UDC imaging. Specifically, we note that the UDC imaging degradation process contains brightness attenuation, blurring, and noise corruption. Thus we model each degradation with a characteristic-related modular network, and all modular networks are cascaded to form the generator. Together with a pixel-wise discriminator and supervised loss, we can train the generator to simulate the UDC imaging degradation process. Furthermore, we present a Transformer-style network named DWFormer for UDC image restoration. For practical purposes, we use depth-wise convolution instead of the multi-head self-attention to aggregate local spatial information. Moreover, we propose a novel channel attention module to aggregate global information, which is critical for brightness recovery. We conduct evaluations on the UDC benchmark, and our method surpasses the previous state-of-the-art models by 1.23 dB on the P-OLED track and 0.71 dB on the T-OLED track, respectively.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91521387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
MGTR: End-to-End Mutual Gaze Detection with Transformer 基于变压器的端到端相互凝视检测
Han Guo, Zhengxi Hu, Jingtai Liu
People's looking at each other or mutual gaze is ubiquitous in our daily interactions, and detecting mutual gaze is of great significance for understanding human social scenes. Current mutual gaze detection methods focus on two-stage methods, whose inference speed is limited by the two-stage pipeline and the performance in the second stage is affected by the first one. In this paper, we propose a novel one-stage mutual gaze detection framework called Mutual Gaze TRansformer or MGTR to perform mutual gaze detection in an end-to-end manner. By designing mutual gaze instance triples, MGTR can detect each human head bounding box and simultaneously infer mutual gaze relationship based on global image information, which streamlines the whole process with simplicity. Experimental results on two mutual gaze datasets show that our method is able to accelerate mutual gaze detection process without losing performance. Ablation study shows that different components of MGTR can capture different levels of semantic information in images. Code is available at https://github.com/Gmbition/MGTR
在我们的日常交往中,人们相互注视或相互凝视是无处不在的,检测相互凝视对于理解人类社会场景具有重要意义。目前的互凝视检测方法多采用两阶段方法,其推理速度受两阶段管道的限制,第二阶段的性能受第一阶段的影响。在本文中,我们提出了一种新的单阶段互凝视检测框架,称为互凝视转换器或MGTR,以端到端方式进行互凝视检测。通过设计互凝视实例三元组,MGTR可以检测到每个人的头部边界框,同时根据全局图像信息推断出互凝视关系,简化了整个过程。在两个互凝视数据集上的实验结果表明,我们的方法能够在不损失性能的情况下加速互凝视检测过程。消融研究表明,不同分量的脑磁共振成像可以捕获不同程度的图像语义信息。代码可从https://github.com/Gmbition/MGTR获得
{"title":"MGTR: End-to-End Mutual Gaze Detection with Transformer","authors":"Han Guo, Zhengxi Hu, Jingtai Liu","doi":"10.48550/arXiv.2209.10930","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10930","url":null,"abstract":"People's looking at each other or mutual gaze is ubiquitous in our daily interactions, and detecting mutual gaze is of great significance for understanding human social scenes. Current mutual gaze detection methods focus on two-stage methods, whose inference speed is limited by the two-stage pipeline and the performance in the second stage is affected by the first one. In this paper, we propose a novel one-stage mutual gaze detection framework called Mutual Gaze TRansformer or MGTR to perform mutual gaze detection in an end-to-end manner. By designing mutual gaze instance triples, MGTR can detect each human head bounding box and simultaneously infer mutual gaze relationship based on global image information, which streamlines the whole process with simplicity. Experimental results on two mutual gaze datasets show that our method is able to accelerate mutual gaze detection process without losing performance. Ablation study shows that different components of MGTR can capture different levels of semantic information in images. Code is available at https://github.com/Gmbition/MGTR","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84899233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DIG: Draping Implicit Garment over the Human Body DIG:将隐蔽性的衣服披在人体上
Ren Li, Benoît Guillard, Edoardo Remelli, P. Fua
Existing data-driven methods for draping garments over human bodies, despite being effective, cannot handle garments of arbitrary topology and are typically not end-to-end differentiable. To address these limitations, we propose an end-to-end differentiable pipeline that represents garments using implicit surfaces and learns a skinning field conditioned on shape and pose parameters of an articulated body model. To limit body-garment interpenetrations and artifacts, we propose an interpenetration-aware pre-processing strategy of training data and a novel training loss that penalizes self-intersections while draping garments. We demonstrate that our method yields more accurate results for garment reconstruction and deformation with respect to state of the art methods. Furthermore, we show that our method, thanks to its end-to-end differentiability, allows to recover body and garments parameters jointly from image observations, something that previous work could not do.
现有的数据驱动的方法尽管有效,但不能处理任意拓扑的服装,并且通常不是端到端可微的。为了解决这些限制,我们提出了一个端到端的可微分管道,该管道使用隐式表面表示服装,并学习基于铰接体模型的形状和姿态参数的蒙皮场。为了限制身体-服装的交叉和伪影,我们提出了一种交叉感知的训练数据预处理策略和一种新的训练损失,该策略可以在悬挂服装时惩罚自交叉。我们证明,我们的方法产生更准确的结果,服装的重建和变形相对于国家的最先进的方法。此外,我们表明,由于我们的方法具有端到端可微分性,可以从图像观测中联合恢复身体和服装参数,这是以前的工作无法做到的。
{"title":"DIG: Draping Implicit Garment over the Human Body","authors":"Ren Li, Benoît Guillard, Edoardo Remelli, P. Fua","doi":"10.48550/arXiv.2209.10845","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10845","url":null,"abstract":"Existing data-driven methods for draping garments over human bodies, despite being effective, cannot handle garments of arbitrary topology and are typically not end-to-end differentiable. To address these limitations, we propose an end-to-end differentiable pipeline that represents garments using implicit surfaces and learns a skinning field conditioned on shape and pose parameters of an articulated body model. To limit body-garment interpenetrations and artifacts, we propose an interpenetration-aware pre-processing strategy of training data and a novel training loss that penalizes self-intersections while draping garments. We demonstrate that our method yields more accurate results for garment reconstruction and deformation with respect to state of the art methods. Furthermore, we show that our method, thanks to its end-to-end differentiability, allows to recover body and garments parameters jointly from image observations, something that previous work could not do.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88753799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
LatentGaze: Cross-Domain Gaze Estimation through Gaze-Aware Analytic Latent Code Manipulation LatentGaze:通过注视感知分析潜在代码操作的跨域凝视估计
Isack Lee, June Yun, Hee Hyeon Kim, Youngju Na, S. Yoo
Although recent gaze estimation methods lay great emphasis on attentively extracting gaze-relevant features from facial or eye images, how to define features that include gaze-relevant components has been ambiguous. This obscurity makes the model learn not only gaze-relevant features but also irrelevant ones. In particular, it is fatal for the cross-dataset performance. To overcome this challenging issue, we propose a gaze-aware analytic manipulation method, based on a data-driven approach with generative adversarial network inversion's disentanglement characteristics, to selectively utilize gaze-relevant features in a latent code. Furthermore, by utilizing GAN-based encoder-generator process, we shift the input image from the target domain to the source domain image, which a gaze estimator is sufficiently aware. In addition, we propose gaze distortion loss in the encoder that prevents the distortion of gaze information. The experimental results demonstrate that our method achieves state-of-the-art gaze estimation accuracy in a cross-domain gaze estimation tasks. This code is available at https://github.com/leeisack/LatentGaze/.
虽然目前的注视估计方法非常注重从面部或眼睛图像中提取与注视相关的特征,但如何定义包含注视相关成分的特征一直是模糊的。这种模糊性使得模型不仅可以学习与凝视相关的特征,还可以学习无关的特征。特别是,它对跨数据集的性能是致命的。为了克服这一具有挑战性的问题,我们提出了一种基于数据驱动方法的凝视感知分析操作方法,该方法具有生成对抗网络反演的解纠缠特性,可以选择性地利用潜在代码中的凝视相关特征。此外,我们利用基于gan的编码器-生成器过程,将输入图像从目标域转移到源域图像,使注视估计器能够充分感知源域图像。此外,我们在编码器中提出了凝视失真损失,以防止凝视信息失真。实验结果表明,该方法在跨域注视估计任务中达到了最先进的注视估计精度。此代码可从https://github.com/leeisack/LatentGaze/获得。
{"title":"LatentGaze: Cross-Domain Gaze Estimation through Gaze-Aware Analytic Latent Code Manipulation","authors":"Isack Lee, June Yun, Hee Hyeon Kim, Youngju Na, S. Yoo","doi":"10.48550/arXiv.2209.10171","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10171","url":null,"abstract":"Although recent gaze estimation methods lay great emphasis on attentively extracting gaze-relevant features from facial or eye images, how to define features that include gaze-relevant components has been ambiguous. This obscurity makes the model learn not only gaze-relevant features but also irrelevant ones. In particular, it is fatal for the cross-dataset performance. To overcome this challenging issue, we propose a gaze-aware analytic manipulation method, based on a data-driven approach with generative adversarial network inversion's disentanglement characteristics, to selectively utilize gaze-relevant features in a latent code. Furthermore, by utilizing GAN-based encoder-generator process, we shift the input image from the target domain to the source domain image, which a gaze estimator is sufficiently aware. In addition, we propose gaze distortion loss in the encoder that prevents the distortion of gaze information. The experimental results demonstrate that our method achieves state-of-the-art gaze estimation accuracy in a cross-domain gaze estimation tasks. This code is available at https://github.com/leeisack/LatentGaze/.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79263380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
IoU-Enhanced Attention for End-to-End Task Specific Object Detection 端到端任务特定对象检测的iou增强关注
Jing Zhao, Shengjian Wu, Li Sun, Qingli Li
Without densely tiled anchor boxes or grid points in the image, sparse R-CNN achieves promising results through a set of object queries and proposal boxes updated in the cascaded training manner. However, due to the sparse nature and the one-to-one relation between the query and its attending region, it heavily depends on the self attention, which is usually inaccurate in the early training stage. Moreover, in a scene of dense objects, the object query interacts with many irrelevant ones, reducing its uniqueness and harming the performance. This paper proposes to use IoU between different boxes as a prior for the value routing in self attention. The original attention matrix multiplies the same size matrix computed from the IoU of proposal boxes, and they determine the routing scheme so that the irrelevant features can be suppressed. Furthermore, to accurately extract features for both classification and regression, we add two lightweight projection heads to provide the dynamic channel masks based on object query, and they multiply with the output from dynamic convs, making the results suitable for the two different tasks. We validate the proposed scheme on different datasets, including MS-COCO and CrowdHuman, showing that it significantly improves the performance and increases the model convergence speed.
稀疏R-CNN在没有图像中密集平铺的锚框或网格点的情况下,通过一组以级联训练方式更新的对象查询和建议框,取得了令人满意的结果。然而,由于查询本身的稀疏性以及查询与其所属区域之间的一对一关系,严重依赖于自关注,在训练早期通常是不准确的。此外,在对象密集的场景中,对象查询与许多不相关的对象交互,降低了对象查询的唯一性,影响了性能。本文提出在不同盒子之间使用IoU作为自关注中值路由的先验。原始注意矩阵乘以由提案框的IoU计算出的相同大小的矩阵,确定路由方案,从而抑制不相关的特征。此外,为了准确地提取分类和回归的特征,我们添加了两个轻量级的投影头来提供基于对象查询的动态通道掩码,并将它们与动态卷积的输出相乘,使结果适合两种不同的任务。我们在MS-COCO和CrowdHuman等不同的数据集上对该方案进行了验证,结果表明该方案显著提高了性能,提高了模型的收敛速度。
{"title":"IoU-Enhanced Attention for End-to-End Task Specific Object Detection","authors":"Jing Zhao, Shengjian Wu, Li Sun, Qingli Li","doi":"10.48550/arXiv.2209.10391","DOIUrl":"https://doi.org/10.48550/arXiv.2209.10391","url":null,"abstract":"Without densely tiled anchor boxes or grid points in the image, sparse R-CNN achieves promising results through a set of object queries and proposal boxes updated in the cascaded training manner. However, due to the sparse nature and the one-to-one relation between the query and its attending region, it heavily depends on the self attention, which is usually inaccurate in the early training stage. Moreover, in a scene of dense objects, the object query interacts with many irrelevant ones, reducing its uniqueness and harming the performance. This paper proposes to use IoU between different boxes as a prior for the value routing in self attention. The original attention matrix multiplies the same size matrix computed from the IoU of proposal boxes, and they determine the routing scheme so that the irrelevant features can be suppressed. Furthermore, to accurately extract features for both classification and regression, we add two lightweight projection heads to provide the dynamic channel masks based on object query, and they multiply with the output from dynamic convs, making the results suitable for the two different tasks. We validate the proposed scheme on different datasets, including MS-COCO and CrowdHuman, showing that it significantly improves the performance and increases the model convergence speed.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87538235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Revisiting Image Pyramid Structure for High Resolution Salient Object Detection 基于图像金字塔结构的高分辨率显著目标检测
Taehung Kim, Kunhee Kim, J. Lee, D. Cha, Ji-Heon Lee, Daijin Kim
Salient object detection (SOD) has been in the spotlight recently, yet has been studied less for high-resolution (HR) images. Unfortunately, HR images and their pixel-level annotations are certainly more labor-intensive and time-consuming compared to low-resolution (LR) images and annotations. Therefore, we propose an image pyramid-based SOD framework, Inverse Saliency Pyramid Reconstruction Network (InSPyReNet), for HR prediction without any of HR datasets. We design InSPyReNet to produce a strict image pyramid structure of saliency map, which enables to ensemble multiple results with pyramid-based image blending. For HR prediction, we design a pyramid blending method which synthesizes two different image pyramids from a pair of LR and HR scale from the same image to overcome effective receptive field (ERF) discrepancy. Our extensive evaluations on public LR and HR SOD benchmarks demonstrate that InSPyReNet surpasses the State-of-the-Art (SotA) methods on various SOD metrics and boundary accuracy.
显著目标检测(SOD)是近年来备受关注的问题,但对高分辨率图像的研究较少。不幸的是,与低分辨率(LR)图像和注释相比,HR图像及其像素级注释肯定更耗费人力和时间。因此,我们提出了一个基于图像金字塔的SOD框架,即逆显著性金字塔重建网络(InSPyReNet),用于在没有任何HR数据集的情况下进行HR预测。我们设计的InSPyReNet是为了生成一个严格的图像金字塔结构的显著性图,它可以使用基于金字塔的图像混合来集成多个结果。在HR预测方面,我们设计了一种金字塔混合方法,利用同一图像的一对LR和HR尺度合成两个不同的图像金字塔,以克服有效接受野(ERF)的差异。我们对公共LR和HR超氧化物歧化酶基准的广泛评估表明,InSPyReNet在各种超氧化物歧化酶指标和边界精度方面优于最先进(SotA)的方法。
{"title":"Revisiting Image Pyramid Structure for High Resolution Salient Object Detection","authors":"Taehung Kim, Kunhee Kim, J. Lee, D. Cha, Ji-Heon Lee, Daijin Kim","doi":"10.48550/arXiv.2209.09475","DOIUrl":"https://doi.org/10.48550/arXiv.2209.09475","url":null,"abstract":"Salient object detection (SOD) has been in the spotlight recently, yet has been studied less for high-resolution (HR) images. Unfortunately, HR images and their pixel-level annotations are certainly more labor-intensive and time-consuming compared to low-resolution (LR) images and annotations. Therefore, we propose an image pyramid-based SOD framework, Inverse Saliency Pyramid Reconstruction Network (InSPyReNet), for HR prediction without any of HR datasets. We design InSPyReNet to produce a strict image pyramid structure of saliency map, which enables to ensemble multiple results with pyramid-based image blending. For HR prediction, we design a pyramid blending method which synthesizes two different image pyramids from a pair of LR and HR scale from the same image to overcome effective receptive field (ERF) discrepancy. Our extensive evaluations on public LR and HR SOD benchmarks demonstrate that InSPyReNet surpasses the State-of-the-Art (SotA) methods on various SOD metrics and boundary accuracy.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77561772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Causes of Catastrophic Forgetting in Class-Incremental Semantic Segmentation 类增量语义分割中灾难性遗忘的成因
Tobias Kalb, J. Beyerer
Class-incremental learning for semantic segmentation (CiSS) is presently a highly researched field which aims at updating a semantic segmentation model by sequentially learning new semantic classes. A major challenge in CiSS is overcoming the effects of catastrophic forgetting, which describes the sudden drop of accuracy on previously learned classes after the model is trained on a new set of classes. Despite latest advances in mitigating catastrophic forgetting, the underlying causes of forgetting specifically in CiSS are not well understood. Therefore, in a set of experiments and representational analyses, we demonstrate that the semantic shift of the background class and a bias towards new classes are the major causes of forgetting in CiSS. Furthermore, we show that both causes mostly manifest themselves in deeper classification layers of the network, while the early layers of the model are not affected. Finally, we demonstrate how both causes are effectively mitigated utilizing the information contained in the background, with the help of knowledge distillation and an unbiased cross-entropy loss.
语义切分的类增量学习(Class-incremental learning for semantic segmentation, CiSS)是目前研究较多的一个领域,其目的是通过顺序学习新的语义类来更新语义切分模型。CiSS的一个主要挑战是克服灾难性遗忘的影响,灾难性遗忘描述的是模型在一组新的类别上训练后,对先前学习过的类别的准确性突然下降。尽管在减轻灾难性遗忘方面取得了最新进展,但CiSS中遗忘的潜在原因尚未得到很好的理解。因此,在一系列实验和表征分析中,我们证明了背景类的语义转移和对新类的偏见是CiSS中遗忘的主要原因。此外,我们表明,这两个原因大多表现在网络的深层分类层中,而模型的早期层不受影响。最后,我们展示了如何利用背景中包含的信息,在知识蒸馏和无偏交叉熵损失的帮助下,有效地减轻这两个原因。
{"title":"Causes of Catastrophic Forgetting in Class-Incremental Semantic Segmentation","authors":"Tobias Kalb, J. Beyerer","doi":"10.48550/arXiv.2209.08010","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08010","url":null,"abstract":"Class-incremental learning for semantic segmentation (CiSS) is presently a highly researched field which aims at updating a semantic segmentation model by sequentially learning new semantic classes. A major challenge in CiSS is overcoming the effects of catastrophic forgetting, which describes the sudden drop of accuracy on previously learned classes after the model is trained on a new set of classes. Despite latest advances in mitigating catastrophic forgetting, the underlying causes of forgetting specifically in CiSS are not well understood. Therefore, in a set of experiments and representational analyses, we demonstrate that the semantic shift of the background class and a bias towards new classes are the major causes of forgetting in CiSS. Furthermore, we show that both causes mostly manifest themselves in deeper classification layers of the network, while the early layers of the model are not affected. Finally, we demonstrate how both causes are effectively mitigated utilizing the information contained in the background, with the help of knowledge distillation and an unbiased cross-entropy loss.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90700748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1