首页 > 最新文献

International Journal of Computer Vision最新文献

英文 中文
Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample 通过自适应约束自我注意力和因果关系解构样本进行面部动作单元检测
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-17 DOI: 10.1007/s11263-024-02258-6
Zhiwen Shao, Hancheng Zhu, Yong Zhou, Xiang Xiang, Bing Liu, Rui Yao, Lizhuang Ma

Facial action unit (AU) detection remains a challenging task, due to the subtlety, dynamics, and diversity of AUs. Recently, the prevailing techniques of self-attention and causal inference have been introduced to AU detection. However, most existing methods directly learn self-attention guided by AU detection, or employ common patterns for all AUs during causal intervention. The former often captures irrelevant information in a global range, and the latter ignores the specific causal characteristic of each AU. In this paper, we propose a novel AU detection framework called (textrm{AC}^{2})D by adaptively constraining self-attention weight distribution and causally deconfounding the sample confounder. Specifically, we explore the mechanism of self-attention weight distribution, in which the self-attention weight distribution of each AU is regarded as spatial distribution and is adaptively learned under the constraint of location-predefined attention and the guidance of AU detection. Moreover, we propose a causal intervention module for each AU, in which the bias caused by training samples and the interference from irrelevant AUs are both suppressed. Extensive experiments show that our method achieves competitive performance compared to state-of-the-art AU detection approaches on challenging benchmarks, including BP4D, DISFA, GFT, and BP4D+ in constrained scenarios and Aff-Wild2 in unconstrained scenarios.

由于面部动作单元(AU)的微妙性、动态性和多样性,面部动作单元检测仍然是一项具有挑战性的任务。最近,人们将自我注意和因果推理等流行技术引入到 AU 检测中。然而,大多数现有方法都是在非易失性检测的指导下直接学习自我注意,或者在因果干预过程中对所有非易失性采用通用模式。前者往往捕捉的是全局范围内的无关信息,后者则忽略了每个 AU 的具体因果特征。在本文中,我们提出了一种新颖的 AU 检测框架,称为 (textrm{AC}^{2})D,其方法是自适应地约束自我关注权重分布,并从因果关系上消除样本混杂因素。具体来说,我们探讨了自注意力权重分布的机制,即把每个非盟的自注意力权重分布视为空间分布,并在位置预定义注意力的约束和非盟检测的指导下进行自适应学习。此外,我们还为每个 AU 提出了一个因果干预模块,在这个模块中,训练样本造成的偏差和来自无关 AU 的干扰都会被抑制。广泛的实验表明,与最先进的非盟检测方法相比,我们的方法在具有挑战性的基准测试中取得了具有竞争力的性能,这些基准测试包括受限场景下的 BP4D、DISFA、GFT 和 BP4D+,以及非受限场景下的 Aff-Wild2。
{"title":"Facial Action Unit Detection by Adaptively Constraining Self-Attention and Causally Deconfounding Sample","authors":"Zhiwen Shao, Hancheng Zhu, Yong Zhou, Xiang Xiang, Bing Liu, Rui Yao, Lizhuang Ma","doi":"10.1007/s11263-024-02258-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02258-6","url":null,"abstract":"<p>Facial action unit (AU) detection remains a challenging task, due to the subtlety, dynamics, and diversity of AUs. Recently, the prevailing techniques of self-attention and causal inference have been introduced to AU detection. However, most existing methods directly learn self-attention guided by AU detection, or employ common patterns for all AUs during causal intervention. The former often captures irrelevant information in a global range, and the latter ignores the specific causal characteristic of each AU. In this paper, we propose a novel AU detection framework called <span>(textrm{AC}^{2})</span>D by adaptively constraining self-attention weight distribution and causally deconfounding the sample confounder. Specifically, we explore the mechanism of self-attention weight distribution, in which the self-attention weight distribution of each AU is regarded as spatial distribution and is adaptively learned under the constraint of location-predefined attention and the guidance of AU detection. Moreover, we propose a causal intervention module for each AU, in which the bias caused by training samples and the interference from irrelevant AUs are both suppressed. Extensive experiments show that our method achieves competitive performance compared to state-of-the-art AU detection approaches on challenging benchmarks, including BP4D, DISFA, GFT, and BP4D+ in constrained scenarios and Aff-Wild2 in unconstrained scenarios.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"232 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142448787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Data-Centric Face Anti-spoofing: Improving Cross-Domain Generalization via Physics-Based Data Synthesis 实现以数据为中心的人脸反欺骗:通过基于物理的数据合成提高跨域通用性
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-17 DOI: 10.1007/s11263-024-02240-2
Rizhao Cai, Cecelia Soh, Zitong Yu, Haoliang Li, Wenhan Yang, Alex C. Kot

Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, etc. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at https://github.com/RizhaoCai/FAS-Aug.

人脸反欺骗(FAS)研究面临着跨领域问题的挑战,即训练数据和测试数据之间存在领域差距。近期的 FAS 研究主要以模型为中心,侧重于开发领域泛化算法以提高跨领域性能,而以数据为中心的人脸反欺骗研究则在很大程度上忽视了从数据质量和数量上提高泛化能力。因此,我们的工作从以数据为中心的 FAS 入手,从数据角度进行全面研究,以提高 FAS 模型的跨域泛化能力。更具体地说,首先,基于捕获和再捕获的物理过程,我们提出了针对特定任务的 FAS 数据增强(FAS-Aug),通过合成印刷噪声、色彩失真、摩尔纹等人工痕迹数据来增加数据多样性。我们的实验表明,在训练 FAS 模型时,使用我们的 FAS 扩增可以超越传统的图像扩增,从而获得更好的跨领域性能。不过,我们也注意到,模型可能会依赖于增强后的人工图像,而人工图像并不是环境不变的,因此使用 FAS-Aug 可能会产生负面影响。因此,我们提出了 "欺骗攻击风险均衡化"(Spoofing Attack Risk Equalization,SARE),以防止模型依赖于某些类型的人工制品,并提高泛化性能。最后但并非最不重要的一点是,我们提出的 FAS-Aug 和 SARE 与最新的 Vision Transformer 主干网可在 FAS 跨域泛化协议上实现最先进的性能。具体实现可访问 https://github.com/RizhaoCai/FAS-Aug。
{"title":"Towards Data-Centric Face Anti-spoofing: Improving Cross-Domain Generalization via Physics-Based Data Synthesis","authors":"Rizhao Cai, Cecelia Soh, Zitong Yu, Haoliang Li, Wenhan Yang, Alex C. Kot","doi":"10.1007/s11263-024-02240-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02240-2","url":null,"abstract":"<p>Face Anti-Spoofing (FAS) research is challenged by the cross-domain problem, where there is a domain gap between the training and testing data. While recent FAS works are mainly model-centric, focusing on developing domain generalization algorithms for improving cross-domain performance, data-centric research for face anti-spoofing, improving generalization from data quality and quantity, is largely ignored. Therefore, our work starts with data-centric FAS by conducting a comprehensive investigation from the data perspective for improving cross-domain generalization of FAS models. More specifically, at first, based on physical procedures of capturing and recapturing, we propose task-specific FAS data augmentation (FAS-Aug), which increases data diversity by synthesizing data of artifacts, such as printing noise, color distortion, moiré pattern, etc. Our experiments show that using our FAS augmentation can surpass traditional image augmentation in training FAS models to achieve better cross-domain performance. Nevertheless, we observe that models may rely on the augmented artifacts, which are not environment-invariant, and using FAS-Aug may have a negative effect. As such, we propose Spoofing Attack Risk Equalization (SARE) to prevent models from relying on certain types of artifacts and improve the generalization performance. Last but not least, our proposed FAS-Aug and SARE with recent Vision Transformer backbones can achieve state-of-the-art performance on the FAS cross-domain generalization protocols. The implementation is available at https://github.com/RizhaoCai/FAS-Aug.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142448786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Blind Multimodal Quality Assessment of Low-Light Images 低照度图像的盲多模态质量评估
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-16 DOI: 10.1007/s11263-024-02239-9
Miaohui Wang, Zhuowei Xu, Mai Xu, Weisi Lin

Blind image quality assessment (BIQA) aims at automatically and accurately forecasting objective scores for visual signals, which has been widely used to monitor product and service quality in low-light applications, covering smartphone photography, video surveillance, autonomous driving, etc. Recent developments in this field are dominated by unimodal solutions inconsistent with human subjective rating patterns, where human visual perception is simultaneously reflected by multiple sensory information. In this article, we present a unique blind multimodal quality assessment (BMQA) of low-light images from subjective evaluation to objective score. To investigate the multimodal mechanism, we first establish a multimodal low-light image quality (MLIQ) database with authentic low-light distortions, containing image-text modality pairs. Further, we specially design the key modules of BMQA, considering multimodal quality representation, latent feature alignment and fusion, and hybrid self-supervised and supervised learning. Extensive experiments show that our BMQA yields state-of-the-art accuracy on the proposed MLIQ benchmark database. In particular, we also build an independent single-image modality Dark-4K database, which is used to verify its applicability and generalization performance in mainstream unimodal applications. Qualitative and quantitative results on Dark-4K show that BMQA achieves superior performance to existing BIQA approaches as long as a pre-trained model is provided to generate text descriptions. The proposed framework and two databases as well as the collected BIQA methods and evaluation metrics are made publicly available on https://charwill.github.io/bmqa.html.

盲图像质量评估(BIQA)旨在自动、准确地预测视觉信号的客观分数,已被广泛应用于监控弱光应用中的产品和服务质量,涵盖智能手机拍照、视频监控、自动驾驶等。该领域的最新发展主要是与人类主观评分模式不一致的单模态解决方案,而人类的视觉感知是由多种感官信息同时反映的。在本文中,我们提出了一种独特的从主观评价到客观评分的低照度图像盲多模态质量评估(BMQA)。为了研究多模态机制,我们首先建立了一个多模态弱光图像质量(MLIQ)数据库,其中包含真实的弱光失真图像-文本模态对。此外,我们还专门设计了 BMQA 的关键模块,考虑了多模态质量表示、潜在特征对齐和融合以及混合自监督和监督学习。广泛的实验表明,我们的 BMQA 在提议的 MLIQ 基准数据库上获得了一流的准确度。特别是,我们还建立了一个独立的单图像模态 Dark-4K 数据库,用于验证其在主流单模态应用中的适用性和泛化性能。Dark-4K 数据库的定性和定量结果表明,只要提供一个预训练模型来生成文本描述,BMQA 就能取得优于现有 BIQA 方法的性能。建议的框架和两个数据库以及收集的 BIQA 方法和评估指标可在 https://charwill.github.io/bmqa.html 上公开获取。
{"title":"Blind Multimodal Quality Assessment of Low-Light Images","authors":"Miaohui Wang, Zhuowei Xu, Mai Xu, Weisi Lin","doi":"10.1007/s11263-024-02239-9","DOIUrl":"https://doi.org/10.1007/s11263-024-02239-9","url":null,"abstract":"<p>Blind image quality assessment (BIQA) aims at automatically and accurately forecasting objective scores for visual signals, which has been widely used to monitor product and service quality in low-light applications, covering smartphone photography, video surveillance, autonomous driving, etc. Recent developments in this field are dominated by unimodal solutions inconsistent with human subjective rating patterns, where human visual perception is simultaneously reflected by multiple sensory information. In this article, we present a unique blind multimodal quality assessment (BMQA) of low-light images from subjective evaluation to objective score. To investigate the multimodal mechanism, we first establish a multimodal low-light image quality (MLIQ) database with authentic low-light distortions, containing image-text modality pairs. Further, we specially design the key modules of BMQA, considering multimodal quality representation, latent feature alignment and fusion, and hybrid self-supervised and supervised learning. Extensive experiments show that our BMQA yields state-of-the-art accuracy on the proposed MLIQ benchmark database. In particular, we also build an independent single-image modality Dark-4K database, which is used to verify its applicability and generalization performance in mainstream unimodal applications. Qualitative and quantitative results on Dark-4K show that BMQA achieves superior performance to existing BIQA approaches as long as a pre-trained model is provided to generate text descriptions. The proposed framework and two databases as well as the collected BIQA methods and evaluation metrics are made publicly available on https://charwill.github.io/bmqa.html.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"1 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142443819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Audio-Visual Segmentation with Semantics 利用语义进行音视频分割
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-15 DOI: 10.1007/s11263-024-02261-x
Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.

我们提出了一个名为视听分割(AVS)的新问题,其目标是输出图像帧中发出声音物体的像素级地图。为了促进这项研究,我们构建了首个视听分割基准,即 AVSBench,为有声视频中的发声对象提供像素级注释。它包含三个子集:AVSBench-object(单源子集、多源子集)和 AVSBench-semantic(语义标签子集)。相应地,研究了三种设置:1)单声源半监督视听分割;2)多声源全监督视听分割;3)全监督视听语义分割。前两种设置需要生成声音对象的二进制掩码,指示与音频相对应的像素,而第三种设置则进一步要求生成指示对象类别的语义图。为了解决这些问题,我们提出了一种新的基线方法,该方法使用时间像素视听交互模块注入音频语义,作为视觉分割过程的指导。我们还设计了一种正则化损失,以鼓励在训练过程中进行视听映射。在 AVSBench 数据集上进行的定量和定性实验将我们的方法与现有的几种相关任务的方法进行了比较,证明所提出的方法有望在音频和像素视觉语义之间架起一座桥梁。代码见 https://github.com/OpenNLPLab/AVSBench。
{"title":"Audio-Visual Segmentation with Semantics","authors":"Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong","doi":"10.1007/s11263-024-02261-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02261-x","url":null,"abstract":"<p>We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, <i>i.e.</i>, AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"7 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142440236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Accurate Low-bit Quantization towards Efficient Computational Imaging 学习精确低位量化,实现高效计算成像
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-14 DOI: 10.1007/s11263-024-02250-0
Sheng Xu, Yanjing Li, Chuanjian Liu, Baochang Zhang

Recent advances of deep neural networks (DNNs) promote low-level vision applications in real-world scenarios, e.g., image enhancement, dehazing. Nevertheless, DNN-based methods encounter challenges in terms of high computational and memory requirements, especially when deployed on real-world devices with limited resources. Quantization is one of effective compression techniques that significantly reduces computational and memory requirements by employing low-bit parameters and bit-wise operations. However, low-bit quantization for computational imaging (Q-Imaging) remains largely unexplored and usually suffer from a significant performance drop compared with the real-valued counterparts. In this work, through empirical analysis, we identify the main factor responsible for such significant performance drop underlies in the large gradient estimation error from non-differentiable weight quantization methods, and the activation information degeneration along with the activation quantization. To address these issues, we introduce a differentiable quantization search (DQS) method to learn the quantized weights and an information boosting module (IBM) for network activation quantization. Our DQS method allows us to treat the discrete weights in a quantized neural network as variables that can be searched. We achieve this end by using a differential approach to accurately search for these weights. In specific, each weight is represented as a probability distribution across a set of discrete values. During training, these probabilities are optimized, and the values with the highest probabilities are chosen to construct the desired quantized network. Moreover, our IBM module can rectify the activation distribution before quantization to maximize the self-information entropy, which retains the maximum information during the quantization process. Extensive experiments across a range of image processing tasks, including enhancement, super-resolution, denoising and dehazing, validate the effectiveness of our Q-Imaging along with superior performances compared to a variety of state-of-the-art quantization methods. In particular, the method in Q-Imaging also achieves a strong generalization performance when composing a detection network for the dark object detection task.

深度神经网络(DNN)的最新进展促进了现实世界中底层视觉应用的发展,如图像增强、去毛刺等。然而,基于 DNN 的方法在高计算和内存要求方面遇到了挑战,尤其是在资源有限的现实世界设备上部署时。量化是一种有效的压缩技术,它通过采用低位参数和比特化操作,大大降低了计算和内存需求。然而,用于计算成像(Q-Imaging)的低比特量化技术在很大程度上仍未得到开发,与实值对应技术相比,其性能通常会大幅下降。在这项工作中,通过实证分析,我们确定了导致性能大幅下降的主要因素,即无差别权重量化方法产生的较大梯度估计误差,以及随着激活量化而产生的激活信息退化。为了解决这些问题,我们引入了可微分量化搜索(DQS)方法来学习量化权重,并引入了信息提升模块(IBM)来进行网络激活量化。我们的 DQS 方法允许我们将量化神经网络中的离散权重视为可以搜索的变量。我们通过使用差分法精确搜索这些权重来实现这一目的。具体来说,每个权重都表示为一组离散值的概率分布。在训练过程中,我们会对这些概率进行优化,并选择概率最高的值来构建所需的量化网络。此外,我们的 IBM 模块还能在量化之前对激活分布进行修正,以最大限度地提高自信息熵,从而在量化过程中保留最大的信息量。在一系列图像处理任务(包括增强、超分辨率、去噪和去色)中进行的广泛实验验证了 Q-Imaging 的有效性,以及与各种最先进量化方法相比的卓越性能。特别是,Q-Imaging 方法在为黑暗物体检测任务组成检测网络时,还实现了强大的泛化性能。
{"title":"Learning Accurate Low-bit Quantization towards Efficient Computational Imaging","authors":"Sheng Xu, Yanjing Li, Chuanjian Liu, Baochang Zhang","doi":"10.1007/s11263-024-02250-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02250-0","url":null,"abstract":"<p>Recent advances of deep neural networks (DNNs) promote low-level vision applications in real-world scenarios, <i>e.g.</i>, image enhancement, dehazing. Nevertheless, DNN-based methods encounter challenges in terms of high computational and memory requirements, especially when deployed on real-world devices with limited resources. Quantization is one of effective compression techniques that significantly reduces computational and memory requirements by employing low-bit parameters and bit-wise operations. However, low-bit quantization for computational imaging (<b>Q-Imaging</b>) remains largely unexplored and usually suffer from a significant performance drop compared with the real-valued counterparts. In this work, through empirical analysis, we identify the main factor responsible for such significant performance drop underlies in the large gradient estimation error from non-differentiable weight quantization methods, and the activation information degeneration along with the activation quantization. To address these issues, we introduce a differentiable quantization search (DQS) method to learn the quantized weights and an information boosting module (IBM) for network activation quantization. Our DQS method allows us to treat the discrete weights in a quantized neural network as variables that can be searched. We achieve this end by using a differential approach to accurately search for these weights. In specific, each weight is represented as a probability distribution across a set of discrete values. During training, these probabilities are optimized, and the values with the highest probabilities are chosen to construct the desired quantized network. Moreover, our IBM module can rectify the activation distribution before quantization to maximize the self-information entropy, which retains the maximum information during the quantization process. Extensive experiments across a range of image processing tasks, including enhancement, super-resolution, denoising and dehazing, validate the effectiveness of our Q-Imaging along with superior performances compared to a variety of state-of-the-art quantization methods. In particular, the method in Q-Imaging also achieves a strong generalization performance when composing a detection network for the dark object detection task.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"69 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Ultra High-Speed Hyperspectral Imaging by Integrating Compressive and Neuromorphic Sampling 通过整合压缩采样和神经形态采样实现超高速高光谱成像
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-14 DOI: 10.1007/s11263-024-02236-y
Mengyue Geng, Lizhi Wang, Lin Zhu, Wei Zhang, Ruiqin Xiong, Yonghong Tian

Hyperspectral and high-speed imaging are both important for scene representation and understanding. However, simultaneously capturing both hyperspectral and high-speed data is still under-explored. In this work, we propose a high-speed hyperspectral imaging system by integrating compressive sensing sampling with bioinspired neuromorphic sampling. Our system includes a coded aperture snapshot spectral imager capturing moderate-speed hyperspectral measurement frames and a spike camera capturing high-speed grayscale dense spike streams. The two cameras provide complementary dual-modality data for reconstructing high-speed hyperspectral videos (HSV). To effectively synergize the two sampling mechanisms and obtain high-quality HSV, we propose a unified multi-modal reconstruction framework. The framework consists of a Spike Spectral Prior Network for spike-based information extraction and prior regularization, coupled with a dual-modality iterative optimization algorithm for reliable reconstruction. We finally build a hardware prototype to verify the effectiveness of our system and algorithm design. Experiments on both simulated and real data demonstrate the superiority of the proposed approach, where for the first time to our knowledge, high-speed HSV with 30 spectral bands can be captured at a frame rate of up to 20,000 FPS.

高光谱和高速成像对于场景的呈现和理解都非常重要。然而,同时捕捉高光谱和高速数据的技术仍未得到充分探索。在这项工作中,我们提出了一种高速高光谱成像系统,它将压缩传感采样与生物启发神经形态采样整合在一起。我们的系统包括一个捕捉中等速度高光谱测量帧的编码孔径快照光谱成像仪和一个捕捉高速灰度密集尖峰流的尖峰相机。两台相机为重建高速高光谱视频(HSV)提供互补的双模态数据。为了有效协同两种采样机制并获得高质量的 HSV,我们提出了一个统一的多模态重建框架。该框架包括一个用于基于尖峰信息提取和先验正则化的尖峰光谱先验网络,以及一个用于可靠重建的双模态迭代优化算法。最后,我们建立了一个硬件原型,以验证我们的系统和算法设计的有效性。在模拟数据和真实数据上进行的实验证明了所建议方法的优越性,据我们所知,这是第一次能以高达 20,000 FPS 的帧速率捕获 30 个光谱带的高速 HSV。
{"title":"Towards Ultra High-Speed Hyperspectral Imaging by Integrating Compressive and Neuromorphic Sampling","authors":"Mengyue Geng, Lizhi Wang, Lin Zhu, Wei Zhang, Ruiqin Xiong, Yonghong Tian","doi":"10.1007/s11263-024-02236-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02236-y","url":null,"abstract":"<p>Hyperspectral and high-speed imaging are both important for scene representation and understanding. However, simultaneously capturing both hyperspectral and high-speed data is still under-explored. In this work, we propose a high-speed hyperspectral imaging system by integrating compressive sensing sampling with bioinspired neuromorphic sampling. Our system includes a coded aperture snapshot spectral imager capturing moderate-speed hyperspectral measurement frames and a spike camera capturing high-speed grayscale dense spike streams. The two cameras provide complementary dual-modality data for reconstructing high-speed hyperspectral videos (HSV). To effectively synergize the two sampling mechanisms and obtain high-quality HSV, we propose a unified multi-modal reconstruction framework. The framework consists of a Spike Spectral Prior Network for spike-based information extraction and prior regularization, coupled with a dual-modality iterative optimization algorithm for reliable reconstruction. We finally build a hardware prototype to verify the effectiveness of our system and algorithm design. Experiments on both simulated and real data demonstrate the superiority of the proposed approach, where for the first time to our knowledge, high-speed HSV with 30 spectral bands can be captured at a frame rate of up to 20,000 FPS.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"10 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
4Seasons: Benchmarking Visual SLAM and Long-Term Localization for Autonomous Driving in Challenging Conditions 4Seasons:挑战条件下自动驾驶的视觉 SLAM 和长期定位基准测试
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-13 DOI: 10.1007/s11263-024-02230-4
Patrick Wenzel, Nan Yang, Rui Wang, Niclas Zeller, Daniel Cremers

In this paper, we present a novel visual SLAM and long-term localization benchmark for autonomous driving in challenging conditions based on the large-scale 4Seasons dataset. The proposed benchmark provides drastic appearance variations caused by seasonal changes and diverse weather and illumination conditions. While significant progress has been made in advancing visual SLAM on small-scale datasets with similar conditions, there is still a lack of unified benchmarks representative of real-world scenarios for autonomous driving. We introduce a new unified benchmark for jointly evaluating visual odometry, global place recognition, and map-based visual localization performance which is crucial to successfully enable autonomous driving in any condition. The data has been collected for more than one year, resulting in more than 300 km of recordings in nine different environments ranging from a multi-level parking garage to urban (including tunnels) to countryside and highway. We provide globally consistent reference poses with up to centimeter-level accuracy obtained from the fusion of direct stereo-inertial odometry with RTK GNSS. We evaluate the performance of several state-of-the-art visual odometry and visual localization baseline approaches on the benchmark and analyze their properties. The experimental results provide new insights into current approaches and show promising potential for future research. Our benchmark and evaluation protocols will be available at https://go.vision.in.tum.de/4seasons.

在本文中,我们基于大规模 4Seasons 数据集,提出了一种新颖的视觉 SLAM 和长期定位基准,用于在具有挑战性的条件下进行自动驾驶。所提出的基准提供了由季节变化、不同天气和光照条件引起的剧烈外观变化。虽然在类似条件的小规模数据集上推进视觉 SLAM 取得了重大进展,但仍缺乏代表真实世界自动驾驶场景的统一基准。我们引入了一个新的统一基准,用于联合评估视觉里程测量、全局位置识别和基于地图的视觉定位性能,这对于在任何条件下成功实现自动驾驶至关重要。我们收集了一年多的数据,在多层停车场、城市(包括隧道)、乡村和高速公路等九种不同环境中记录了 300 多公里的数据。我们通过将直接立体惯性里程测量与 RTK GNSS 融合,提供了全球一致的参考姿势,精度高达厘米级。我们评估了基准上几种最先进的视觉里程测量和视觉定位基准方法的性能,并分析了它们的特性。实验结果为目前的方法提供了新的见解,并显示出未来研究的巨大潜力。我们的基准和评估协议将发布在 https://go.vision.in.tum.de/4seasons 网站上。
{"title":"4Seasons: Benchmarking Visual SLAM and Long-Term Localization for Autonomous Driving in Challenging Conditions","authors":"Patrick Wenzel, Nan Yang, Rui Wang, Niclas Zeller, Daniel Cremers","doi":"10.1007/s11263-024-02230-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02230-4","url":null,"abstract":"<p>In this paper, we present a novel visual SLAM and long-term localization benchmark for autonomous driving in challenging conditions based on the large-scale 4Seasons dataset. The proposed benchmark provides drastic appearance variations caused by seasonal changes and diverse weather and illumination conditions. While significant progress has been made in advancing visual SLAM on small-scale datasets with similar conditions, there is still a lack of unified benchmarks representative of real-world scenarios for autonomous driving. We introduce a new unified benchmark for jointly evaluating visual odometry, global place recognition, and map-based visual localization performance which is crucial to successfully enable autonomous driving in any condition. The data has been collected for more than one year, resulting in more than 300 km of recordings in nine different environments ranging from a multi-level parking garage to urban (including tunnels) to countryside and highway. We provide globally consistent reference poses with up to centimeter-level accuracy obtained from the fusion of direct stereo-inertial odometry with RTK GNSS. We evaluate the performance of several state-of-the-art visual odometry and visual localization baseline approaches on the benchmark and analyze their properties. The experimental results provide new insights into current approaches and show promising potential for future research. Our benchmark and evaluation protocols will be available at https://go.vision.in.tum.de/4seasons.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Edge-Oriented Adversarial Attack for Deep Gait Recognition 针对深度步态识别的边缘对抗攻击
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-10 DOI: 10.1007/s11263-024-02225-1
Saihui Hou, Zengbin Wang, Man Zhang, Chunshui Cao, Xu Liu, Yongzhen Huang

Gait recognition is a non-intrusive method that captures unique walking patterns without subject cooperation, which has emerged as a promising technique across various fields. Recent studies based on Deep Neural Networks (DNNs) have notably improved the performance, however, the potential vulnerability inherent in DNNs and their resistance to interference in practical gait recognition systems remain under-explored. To fill the gap, in this paper, we focus on imperceptible adversarial attack for deep gait recognition and propose an edge-oriented attack strategy tailored for silhouette-based approaches. Specifically, we make a pioneering attempt to explore the intrinsic characteristics of binary silhouettes, with a primary focus on injecting noise perturbations into the edge area. This simple yet effective solution enables sparse attack in both the spatial and temporal dimensions, which largely ensures imperceptibility and simultaneously achieves high success rate. In particular, our solution is built on a unified framework, allowing seamless switching between untargeted and targeted attack modes. Extensive experiments conducted on in-the-lab and in-the-wild benchmarks validate the effectiveness of our attack strategy and emphasize the necessity to study adversarial attack and defense strategy in the near future.

步态识别是一种非侵入式方法,它能在不需要受试者配合的情况下捕捉独特的行走模式,在各个领域已成为一种前景广阔的技术。最近基于深度神经网络(DNN)的研究显著提高了步态识别的性能,然而,DNN固有的潜在弱点及其在实际步态识别系统中的抗干扰能力仍未得到充分探索。为了填补这一空白,我们在本文中重点研究了深度步态识别中的不可感知对抗攻击,并提出了一种为基于剪影的方法量身定制的面向边缘的攻击策略。具体来说,我们开创性地尝试探索二进制剪影的内在特征,主要重点是向边缘区域注入噪声扰动。这种简单而有效的解决方案可以在空间和时间维度上进行稀疏攻击,在很大程度上确保了不可感知性,同时实现了高成功率。特别是,我们的解决方案建立在一个统一的框架上,允许在非目标和目标攻击模式之间无缝切换。在实验室和野外基准上进行的广泛实验验证了我们攻击策略的有效性,并强调了在不久的将来研究对抗性攻击和防御策略的必要性。
{"title":"Edge-Oriented Adversarial Attack for Deep Gait Recognition","authors":"Saihui Hou, Zengbin Wang, Man Zhang, Chunshui Cao, Xu Liu, Yongzhen Huang","doi":"10.1007/s11263-024-02225-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02225-1","url":null,"abstract":"<p>Gait recognition is a non-intrusive method that captures unique walking patterns without subject cooperation, which has emerged as a promising technique across various fields. Recent studies based on Deep Neural Networks (DNNs) have notably improved the performance, however, the potential vulnerability inherent in DNNs and their resistance to interference in practical gait recognition systems remain under-explored. To fill the gap, in this paper, we focus on imperceptible adversarial attack for deep gait recognition and propose an edge-oriented attack strategy tailored for silhouette-based approaches. Specifically, we make a pioneering attempt to explore the intrinsic characteristics of binary silhouettes, with a primary focus on injecting noise perturbations into the edge area. This simple yet effective solution enables sparse attack in both the spatial and temporal dimensions, which largely ensures imperceptibility and simultaneously achieves high success rate. In particular, our solution is built on a unified framework, allowing seamless switching between untargeted and targeted attack modes. Extensive experiments conducted on in-the-lab and in-the-wild benchmarks validate the effectiveness of our attack strategy and emphasize the necessity to study adversarial attack and defense strategy in the near future.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"54 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142405348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DLRA-Net: Deep Local Residual Attention Network with Contextual Refinement for Spectral Super-Resolution DLRA-Net:用于光谱超分辨率的具有上下文细化功能的深度局部残留注意力网络
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-09 DOI: 10.1007/s11263-024-02238-w
Ahmed R. El-gabri, Hussein A. Aly, Tarek S. Ghoniemy, Mohamed A. Elshafey

Hyperspectral Images (HSIs) provide detailed scene insights using extensive spectral bands, crucial for material discrimination and earth observation with substantial costs and low spatial resolution. Recently, Convolutional Neural Networks (CNNs) are common choice for Spectral Super-Resolution (SSR) from Multispectral Images (MSIs). However, they often fail to simultaneously exploit pixel-level noise degradation of MSIs and complex contextual spatial-spectral characteristics of HSIs. In this paper, a Deep Local Residual Attention Network with Contextual Refinement Network (DLRA-Net) is proposed to integrate local low-rank spectral and global contextual priors for improved SSR. Specifically, SSR is unfolded into Contextual-attention Refinement Module (CRM) and Dual Local Residual Attention Module (DLRAM). CRM is proposed to adaptively learn complex contextual priors to guide the convolution layer weights for improved spatial restorations. While DLRAM captures deep refined texture details to enhance contextual priors representations for recovering HSIs. Moreover, lateral fusion strategy is designed to integrate the obtained priors among DLRAMs for faster network convergence. Experimental results on natural-scene datasets with practical noise patterns confirm exceptional DLRA-Net performance with relatively small model size. DLRA-Net demonstrates Maximum Relative Improvements (MRI) between 9.71 and 58.58% in Mean Relative Absolute Error (MRAE) with reduced parameters between 52.18 and 85.85%. Besides, a practical RS-HSI dataset is generated for evaluations showing MRI between 8.64 and 50.56% in MRAE. Furthermore, experiments with HSI classifiers indicate improved performance of reconstructed RS-HSIs compared to RS-MSIs, with MRI in Overall Accuracy (OA) between 7.10 and 15.27%. Lastly, a detailed ablation study assesses model complexity and runtime.

高光谱图像(HSIs)利用广泛的光谱波段提供了详细的场景洞察,对于材料鉴别和地球观测至关重要,但成本高昂且空间分辨率低。最近,卷积神经网络(CNN)成为多光谱图像(MSI)光谱超分辨率(SSR)的常见选择。然而,它们往往无法同时利用 MSIs 的像素级噪声退化和 HSIs 的复杂上下文空间光谱特征。本文提出了一种具有上下文细化网络(DLRA-Net)的深度局部残留注意力网络,以整合局部低阶光谱和全局上下文先验,从而改进 SSR。具体来说,SSR 被展开为上下文注意细化模块(CRM)和双本地残差注意模块(DLRAM)。CRM 用于自适应学习复杂的上下文先验,以指导卷积层权重,从而改进空间复原。而 DLRAM 则能捕捉深层精细纹理细节,以增强上下文先验表征,从而恢复人机交互信号。此外,还设计了横向融合策略,以整合 DLRAM 之间获得的先验信息,从而加快网络收敛速度。在具有实际噪声模式的自然场景数据集上进行的实验结果证实,DLRA-Net 在模型规模相对较小的情况下性能卓越。DLRA-Net 的最大相对改进(MRI)介于 9.71% 和 58.58% 之间,平均相对绝对误差(MRAE)介于 52.18% 和 85.85% 之间。此外,还生成了一个实用的 RS-HSI 数据集进行评估,结果显示平均相对绝对误差(MRAE)在 8.64% 和 50.56% 之间。此外,HSI 分类器的实验表明,与 RS-MSI 相比,重建 RS-HSI 的性能有所提高,MRI 的总体准确率(OA)介于 7.10% 和 15.27% 之间。最后,详细的消融研究评估了模型的复杂性和运行时间。
{"title":"DLRA-Net: Deep Local Residual Attention Network with Contextual Refinement for Spectral Super-Resolution","authors":"Ahmed R. El-gabri, Hussein A. Aly, Tarek S. Ghoniemy, Mohamed A. Elshafey","doi":"10.1007/s11263-024-02238-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02238-w","url":null,"abstract":"<p>Hyperspectral Images (HSIs) provide detailed scene insights using extensive spectral bands, crucial for material discrimination and earth observation with substantial costs and low spatial resolution. Recently, Convolutional Neural Networks (CNNs) are common choice for Spectral Super-Resolution (SSR) from Multispectral Images (MSIs). However, they often fail to simultaneously exploit pixel-level noise degradation of MSIs and complex contextual spatial-spectral characteristics of HSIs. In this paper, a Deep Local Residual Attention Network with Contextual Refinement Network (DLRA-Net) is proposed to integrate local low-rank spectral and global contextual priors for improved SSR. Specifically, SSR is unfolded into Contextual-attention Refinement Module (CRM) and Dual Local Residual Attention Module (DLRAM). CRM is proposed to adaptively learn complex contextual priors to guide the convolution layer weights for improved spatial restorations. While DLRAM captures deep refined texture details to enhance contextual priors representations for recovering HSIs. Moreover, lateral fusion strategy is designed to integrate the obtained priors among DLRAMs for faster network convergence. Experimental results on natural-scene datasets with practical noise patterns confirm exceptional DLRA-Net performance with relatively small model size. DLRA-Net demonstrates Maximum Relative Improvements (MRI) between 9.71 and 58.58% in Mean Relative Absolute Error (MRAE) with reduced parameters between 52.18 and 85.85%. Besides, a practical RS-HSI dataset is generated for evaluations showing MRI between 8.64 and 50.56% in MRAE. Furthermore, experiments with HSI classifiers indicate improved performance of reconstructed RS-HSIs compared to RS-MSIs, with MRI in Overall Accuracy (OA) between 7.10 and 15.27%. Lastly, a detailed ablation study assesses model complexity and runtime.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"43 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142397921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining Generalized Multi-timescale Inconsistency for Detecting Deepfake Videos 挖掘广义多时间尺度不一致性以检测深度伪造视频
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-09 DOI: 10.1007/s11263-024-02249-7
Yang Yu, Rongrong Ni, Siyuan Yang, Yu Ni, Yao Zhao, Alex C. Kot

Recent advancements in face forgery techniques have continuously evolved, leading to emergent security concerns in society. Existing detection methods have poor generalization ability due to the insufficient extraction of dynamic inconsistency cues on the one hand, and their inability to deal well with the gaps between forgery techniques on the other hand. To develop a new generalized framework that emphasizes extracting generalizable multi-timescale inconsistency cues. Firstly, we capture subtle dynamic inconsistency via magnifying the multipath dynamic inconsistency from the local-consecutive short-term temporal view. Secondly, the inter-group graph learning is conducted to establish the sufficient-interactive long-term temporal view for capturing dynamic inconsistency comprehensively. Finally, we design the domain alignment module to directly reduce the distribution gaps via simultaneously disarranging inter- and intra-domain feature distributions for obtaining a more generalized framework. Extensive experiments on six large-scale datasets and the designed generalization evaluation protocols show that our framework outperforms state-of-the-art deepfake video detection methods.

近年来,人脸伪造技术不断发展,引发了社会对安全问题的关注。现有的检测方法一方面由于动态不一致线索提取不足,另一方面由于无法很好地处理不同伪造技术之间的差距,因此通用能力较差。开发一种新的通用框架,强调提取可通用的多时间尺度不一致性线索。首先,我们从局部连续的短期时间视角出发,通过放大多路径动态不一致性来捕捉微妙的动态不一致性。其次,通过组间图学习,建立充分交互的长期时间视图,以全面捕捉动态不一致性。最后,我们设计了领域对齐模块,通过同时打乱领域间和领域内的特征分布来直接缩小分布差距,从而获得一个更具通用性的框架。在六个大规模数据集上进行的广泛实验和设计的泛化评估协议表明,我们的框架优于最先进的深度假视频检测方法。
{"title":"Mining Generalized Multi-timescale Inconsistency for Detecting Deepfake Videos","authors":"Yang Yu, Rongrong Ni, Siyuan Yang, Yu Ni, Yao Zhao, Alex C. Kot","doi":"10.1007/s11263-024-02249-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02249-7","url":null,"abstract":"<p>Recent advancements in face forgery techniques have continuously evolved, leading to emergent security concerns in society. Existing detection methods have poor generalization ability due to the insufficient extraction of dynamic inconsistency cues on the one hand, and their inability to deal well with the gaps between forgery techniques on the other hand. To develop a new generalized framework that emphasizes extracting generalizable multi-timescale inconsistency cues. Firstly, we capture subtle dynamic inconsistency via magnifying the multipath dynamic inconsistency from the local-consecutive short-term temporal view. Secondly, the inter-group graph learning is conducted to establish the sufficient-interactive long-term temporal view for capturing dynamic inconsistency comprehensively. Finally, we design the domain alignment module to directly reduce the distribution gaps via simultaneously disarranging inter- and intra-domain feature distributions for obtaining a more generalized framework. Extensive experiments on six large-scale datasets and the designed generalization evaluation protocols show that our framework outperforms state-of-the-art deepfake video detection methods.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"99 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142398163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1