首页 > 最新文献

International Journal of Computer Vision最新文献

英文 中文
Audio-Visual Segmentation with Semantics 利用语义进行音视频分割
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-15 DOI: 10.1007/s11263-024-02261-x
Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong

We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, i.e., AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.

我们提出了一个名为视听分割(AVS)的新问题,其目标是输出图像帧中发出声音物体的像素级地图。为了促进这项研究,我们构建了首个视听分割基准,即 AVSBench,为有声视频中的发声对象提供像素级注释。它包含三个子集:AVSBench-object(单源子集、多源子集)和 AVSBench-semantic(语义标签子集)。相应地,研究了三种设置:1)单声源半监督视听分割;2)多声源全监督视听分割;3)全监督视听语义分割。前两种设置需要生成声音对象的二进制掩码,指示与音频相对应的像素,而第三种设置则进一步要求生成指示对象类别的语义图。为了解决这些问题,我们提出了一种新的基线方法,该方法使用时间像素视听交互模块注入音频语义,作为视觉分割过程的指导。我们还设计了一种正则化损失,以鼓励在训练过程中进行视听映射。在 AVSBench 数据集上进行的定量和定性实验将我们的方法与现有的几种相关任务的方法进行了比较,证明所提出的方法有望在音频和像素视觉语义之间架起一座桥梁。代码见 https://github.com/OpenNLPLab/AVSBench。
{"title":"Audio-Visual Segmentation with Semantics","authors":"Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, Yiran Zhong","doi":"10.1007/s11263-024-02261-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02261-x","url":null,"abstract":"<p>We propose a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark, <i>i.e.</i>, AVSBench, providing pixel-wise annotations for sounding objects in audible videos. It contains three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied: 1) semi-supervised audio-visual segmentation with a single sound source; 2) fully-supervised audio-visual segmentation with multiple sound sources, and 3) fully-supervised audio-visual semantic segmentation. The first two settings need to generate binary masks of sounding objects indicating pixels corresponding to the audio, while the third setting further requires to generate semantic maps indicating the object category. To deal with these problems, we propose a new baseline method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench dataset compare our approach to several existing methods for related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code can be found at https://github.com/OpenNLPLab/AVSBench.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"7 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142440236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Accurate Low-bit Quantization towards Efficient Computational Imaging 学习精确低位量化,实现高效计算成像
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-14 DOI: 10.1007/s11263-024-02250-0
Sheng Xu, Yanjing Li, Chuanjian Liu, Baochang Zhang

Recent advances of deep neural networks (DNNs) promote low-level vision applications in real-world scenarios, e.g., image enhancement, dehazing. Nevertheless, DNN-based methods encounter challenges in terms of high computational and memory requirements, especially when deployed on real-world devices with limited resources. Quantization is one of effective compression techniques that significantly reduces computational and memory requirements by employing low-bit parameters and bit-wise operations. However, low-bit quantization for computational imaging (Q-Imaging) remains largely unexplored and usually suffer from a significant performance drop compared with the real-valued counterparts. In this work, through empirical analysis, we identify the main factor responsible for such significant performance drop underlies in the large gradient estimation error from non-differentiable weight quantization methods, and the activation information degeneration along with the activation quantization. To address these issues, we introduce a differentiable quantization search (DQS) method to learn the quantized weights and an information boosting module (IBM) for network activation quantization. Our DQS method allows us to treat the discrete weights in a quantized neural network as variables that can be searched. We achieve this end by using a differential approach to accurately search for these weights. In specific, each weight is represented as a probability distribution across a set of discrete values. During training, these probabilities are optimized, and the values with the highest probabilities are chosen to construct the desired quantized network. Moreover, our IBM module can rectify the activation distribution before quantization to maximize the self-information entropy, which retains the maximum information during the quantization process. Extensive experiments across a range of image processing tasks, including enhancement, super-resolution, denoising and dehazing, validate the effectiveness of our Q-Imaging along with superior performances compared to a variety of state-of-the-art quantization methods. In particular, the method in Q-Imaging also achieves a strong generalization performance when composing a detection network for the dark object detection task.

深度神经网络(DNN)的最新进展促进了现实世界中底层视觉应用的发展,如图像增强、去毛刺等。然而,基于 DNN 的方法在高计算和内存要求方面遇到了挑战,尤其是在资源有限的现实世界设备上部署时。量化是一种有效的压缩技术,它通过采用低位参数和比特化操作,大大降低了计算和内存需求。然而,用于计算成像(Q-Imaging)的低比特量化技术在很大程度上仍未得到开发,与实值对应技术相比,其性能通常会大幅下降。在这项工作中,通过实证分析,我们确定了导致性能大幅下降的主要因素,即无差别权重量化方法产生的较大梯度估计误差,以及随着激活量化而产生的激活信息退化。为了解决这些问题,我们引入了可微分量化搜索(DQS)方法来学习量化权重,并引入了信息提升模块(IBM)来进行网络激活量化。我们的 DQS 方法允许我们将量化神经网络中的离散权重视为可以搜索的变量。我们通过使用差分法精确搜索这些权重来实现这一目的。具体来说,每个权重都表示为一组离散值的概率分布。在训练过程中,我们会对这些概率进行优化,并选择概率最高的值来构建所需的量化网络。此外,我们的 IBM 模块还能在量化之前对激活分布进行修正,以最大限度地提高自信息熵,从而在量化过程中保留最大的信息量。在一系列图像处理任务(包括增强、超分辨率、去噪和去色)中进行的广泛实验验证了 Q-Imaging 的有效性,以及与各种最先进量化方法相比的卓越性能。特别是,Q-Imaging 方法在为黑暗物体检测任务组成检测网络时,还实现了强大的泛化性能。
{"title":"Learning Accurate Low-bit Quantization towards Efficient Computational Imaging","authors":"Sheng Xu, Yanjing Li, Chuanjian Liu, Baochang Zhang","doi":"10.1007/s11263-024-02250-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02250-0","url":null,"abstract":"<p>Recent advances of deep neural networks (DNNs) promote low-level vision applications in real-world scenarios, <i>e.g.</i>, image enhancement, dehazing. Nevertheless, DNN-based methods encounter challenges in terms of high computational and memory requirements, especially when deployed on real-world devices with limited resources. Quantization is one of effective compression techniques that significantly reduces computational and memory requirements by employing low-bit parameters and bit-wise operations. However, low-bit quantization for computational imaging (<b>Q-Imaging</b>) remains largely unexplored and usually suffer from a significant performance drop compared with the real-valued counterparts. In this work, through empirical analysis, we identify the main factor responsible for such significant performance drop underlies in the large gradient estimation error from non-differentiable weight quantization methods, and the activation information degeneration along with the activation quantization. To address these issues, we introduce a differentiable quantization search (DQS) method to learn the quantized weights and an information boosting module (IBM) for network activation quantization. Our DQS method allows us to treat the discrete weights in a quantized neural network as variables that can be searched. We achieve this end by using a differential approach to accurately search for these weights. In specific, each weight is represented as a probability distribution across a set of discrete values. During training, these probabilities are optimized, and the values with the highest probabilities are chosen to construct the desired quantized network. Moreover, our IBM module can rectify the activation distribution before quantization to maximize the self-information entropy, which retains the maximum information during the quantization process. Extensive experiments across a range of image processing tasks, including enhancement, super-resolution, denoising and dehazing, validate the effectiveness of our Q-Imaging along with superior performances compared to a variety of state-of-the-art quantization methods. In particular, the method in Q-Imaging also achieves a strong generalization performance when composing a detection network for the dark object detection task.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"69 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Ultra High-Speed Hyperspectral Imaging by Integrating Compressive and Neuromorphic Sampling 通过整合压缩采样和神经形态采样实现超高速高光谱成像
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-14 DOI: 10.1007/s11263-024-02236-y
Mengyue Geng, Lizhi Wang, Lin Zhu, Wei Zhang, Ruiqin Xiong, Yonghong Tian

Hyperspectral and high-speed imaging are both important for scene representation and understanding. However, simultaneously capturing both hyperspectral and high-speed data is still under-explored. In this work, we propose a high-speed hyperspectral imaging system by integrating compressive sensing sampling with bioinspired neuromorphic sampling. Our system includes a coded aperture snapshot spectral imager capturing moderate-speed hyperspectral measurement frames and a spike camera capturing high-speed grayscale dense spike streams. The two cameras provide complementary dual-modality data for reconstructing high-speed hyperspectral videos (HSV). To effectively synergize the two sampling mechanisms and obtain high-quality HSV, we propose a unified multi-modal reconstruction framework. The framework consists of a Spike Spectral Prior Network for spike-based information extraction and prior regularization, coupled with a dual-modality iterative optimization algorithm for reliable reconstruction. We finally build a hardware prototype to verify the effectiveness of our system and algorithm design. Experiments on both simulated and real data demonstrate the superiority of the proposed approach, where for the first time to our knowledge, high-speed HSV with 30 spectral bands can be captured at a frame rate of up to 20,000 FPS.

高光谱和高速成像对于场景的呈现和理解都非常重要。然而,同时捕捉高光谱和高速数据的技术仍未得到充分探索。在这项工作中,我们提出了一种高速高光谱成像系统,它将压缩传感采样与生物启发神经形态采样整合在一起。我们的系统包括一个捕捉中等速度高光谱测量帧的编码孔径快照光谱成像仪和一个捕捉高速灰度密集尖峰流的尖峰相机。两台相机为重建高速高光谱视频(HSV)提供互补的双模态数据。为了有效协同两种采样机制并获得高质量的 HSV,我们提出了一个统一的多模态重建框架。该框架包括一个用于基于尖峰信息提取和先验正则化的尖峰光谱先验网络,以及一个用于可靠重建的双模态迭代优化算法。最后,我们建立了一个硬件原型,以验证我们的系统和算法设计的有效性。在模拟数据和真实数据上进行的实验证明了所建议方法的优越性,据我们所知,这是第一次能以高达 20,000 FPS 的帧速率捕获 30 个光谱带的高速 HSV。
{"title":"Towards Ultra High-Speed Hyperspectral Imaging by Integrating Compressive and Neuromorphic Sampling","authors":"Mengyue Geng, Lizhi Wang, Lin Zhu, Wei Zhang, Ruiqin Xiong, Yonghong Tian","doi":"10.1007/s11263-024-02236-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02236-y","url":null,"abstract":"<p>Hyperspectral and high-speed imaging are both important for scene representation and understanding. However, simultaneously capturing both hyperspectral and high-speed data is still under-explored. In this work, we propose a high-speed hyperspectral imaging system by integrating compressive sensing sampling with bioinspired neuromorphic sampling. Our system includes a coded aperture snapshot spectral imager capturing moderate-speed hyperspectral measurement frames and a spike camera capturing high-speed grayscale dense spike streams. The two cameras provide complementary dual-modality data for reconstructing high-speed hyperspectral videos (HSV). To effectively synergize the two sampling mechanisms and obtain high-quality HSV, we propose a unified multi-modal reconstruction framework. The framework consists of a Spike Spectral Prior Network for spike-based information extraction and prior regularization, coupled with a dual-modality iterative optimization algorithm for reliable reconstruction. We finally build a hardware prototype to verify the effectiveness of our system and algorithm design. Experiments on both simulated and real data demonstrate the superiority of the proposed approach, where for the first time to our knowledge, high-speed HSV with 30 spectral bands can be captured at a frame rate of up to 20,000 FPS.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"10 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
4Seasons: Benchmarking Visual SLAM and Long-Term Localization for Autonomous Driving in Challenging Conditions 4Seasons:挑战条件下自动驾驶的视觉 SLAM 和长期定位基准测试
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-13 DOI: 10.1007/s11263-024-02230-4
Patrick Wenzel, Nan Yang, Rui Wang, Niclas Zeller, Daniel Cremers

In this paper, we present a novel visual SLAM and long-term localization benchmark for autonomous driving in challenging conditions based on the large-scale 4Seasons dataset. The proposed benchmark provides drastic appearance variations caused by seasonal changes and diverse weather and illumination conditions. While significant progress has been made in advancing visual SLAM on small-scale datasets with similar conditions, there is still a lack of unified benchmarks representative of real-world scenarios for autonomous driving. We introduce a new unified benchmark for jointly evaluating visual odometry, global place recognition, and map-based visual localization performance which is crucial to successfully enable autonomous driving in any condition. The data has been collected for more than one year, resulting in more than 300 km of recordings in nine different environments ranging from a multi-level parking garage to urban (including tunnels) to countryside and highway. We provide globally consistent reference poses with up to centimeter-level accuracy obtained from the fusion of direct stereo-inertial odometry with RTK GNSS. We evaluate the performance of several state-of-the-art visual odometry and visual localization baseline approaches on the benchmark and analyze their properties. The experimental results provide new insights into current approaches and show promising potential for future research. Our benchmark and evaluation protocols will be available at https://go.vision.in.tum.de/4seasons.

在本文中,我们基于大规模 4Seasons 数据集,提出了一种新颖的视觉 SLAM 和长期定位基准,用于在具有挑战性的条件下进行自动驾驶。所提出的基准提供了由季节变化、不同天气和光照条件引起的剧烈外观变化。虽然在类似条件的小规模数据集上推进视觉 SLAM 取得了重大进展,但仍缺乏代表真实世界自动驾驶场景的统一基准。我们引入了一个新的统一基准,用于联合评估视觉里程测量、全局位置识别和基于地图的视觉定位性能,这对于在任何条件下成功实现自动驾驶至关重要。我们收集了一年多的数据,在多层停车场、城市(包括隧道)、乡村和高速公路等九种不同环境中记录了 300 多公里的数据。我们通过将直接立体惯性里程测量与 RTK GNSS 融合,提供了全球一致的参考姿势,精度高达厘米级。我们评估了基准上几种最先进的视觉里程测量和视觉定位基准方法的性能,并分析了它们的特性。实验结果为目前的方法提供了新的见解,并显示出未来研究的巨大潜力。我们的基准和评估协议将发布在 https://go.vision.in.tum.de/4seasons 网站上。
{"title":"4Seasons: Benchmarking Visual SLAM and Long-Term Localization for Autonomous Driving in Challenging Conditions","authors":"Patrick Wenzel, Nan Yang, Rui Wang, Niclas Zeller, Daniel Cremers","doi":"10.1007/s11263-024-02230-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02230-4","url":null,"abstract":"<p>In this paper, we present a novel visual SLAM and long-term localization benchmark for autonomous driving in challenging conditions based on the large-scale 4Seasons dataset. The proposed benchmark provides drastic appearance variations caused by seasonal changes and diverse weather and illumination conditions. While significant progress has been made in advancing visual SLAM on small-scale datasets with similar conditions, there is still a lack of unified benchmarks representative of real-world scenarios for autonomous driving. We introduce a new unified benchmark for jointly evaluating visual odometry, global place recognition, and map-based visual localization performance which is crucial to successfully enable autonomous driving in any condition. The data has been collected for more than one year, resulting in more than 300 km of recordings in nine different environments ranging from a multi-level parking garage to urban (including tunnels) to countryside and highway. We provide globally consistent reference poses with up to centimeter-level accuracy obtained from the fusion of direct stereo-inertial odometry with RTK GNSS. We evaluate the performance of several state-of-the-art visual odometry and visual localization baseline approaches on the benchmark and analyze their properties. The experimental results provide new insights into current approaches and show promising potential for future research. Our benchmark and evaluation protocols will be available at https://go.vision.in.tum.de/4seasons.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142431462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Edge-Oriented Adversarial Attack for Deep Gait Recognition 针对深度步态识别的边缘对抗攻击
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-10 DOI: 10.1007/s11263-024-02225-1
Saihui Hou, Zengbin Wang, Man Zhang, Chunshui Cao, Xu Liu, Yongzhen Huang

Gait recognition is a non-intrusive method that captures unique walking patterns without subject cooperation, which has emerged as a promising technique across various fields. Recent studies based on Deep Neural Networks (DNNs) have notably improved the performance, however, the potential vulnerability inherent in DNNs and their resistance to interference in practical gait recognition systems remain under-explored. To fill the gap, in this paper, we focus on imperceptible adversarial attack for deep gait recognition and propose an edge-oriented attack strategy tailored for silhouette-based approaches. Specifically, we make a pioneering attempt to explore the intrinsic characteristics of binary silhouettes, with a primary focus on injecting noise perturbations into the edge area. This simple yet effective solution enables sparse attack in both the spatial and temporal dimensions, which largely ensures imperceptibility and simultaneously achieves high success rate. In particular, our solution is built on a unified framework, allowing seamless switching between untargeted and targeted attack modes. Extensive experiments conducted on in-the-lab and in-the-wild benchmarks validate the effectiveness of our attack strategy and emphasize the necessity to study adversarial attack and defense strategy in the near future.

步态识别是一种非侵入式方法,它能在不需要受试者配合的情况下捕捉独特的行走模式,在各个领域已成为一种前景广阔的技术。最近基于深度神经网络(DNN)的研究显著提高了步态识别的性能,然而,DNN固有的潜在弱点及其在实际步态识别系统中的抗干扰能力仍未得到充分探索。为了填补这一空白,我们在本文中重点研究了深度步态识别中的不可感知对抗攻击,并提出了一种为基于剪影的方法量身定制的面向边缘的攻击策略。具体来说,我们开创性地尝试探索二进制剪影的内在特征,主要重点是向边缘区域注入噪声扰动。这种简单而有效的解决方案可以在空间和时间维度上进行稀疏攻击,在很大程度上确保了不可感知性,同时实现了高成功率。特别是,我们的解决方案建立在一个统一的框架上,允许在非目标和目标攻击模式之间无缝切换。在实验室和野外基准上进行的广泛实验验证了我们攻击策略的有效性,并强调了在不久的将来研究对抗性攻击和防御策略的必要性。
{"title":"Edge-Oriented Adversarial Attack for Deep Gait Recognition","authors":"Saihui Hou, Zengbin Wang, Man Zhang, Chunshui Cao, Xu Liu, Yongzhen Huang","doi":"10.1007/s11263-024-02225-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02225-1","url":null,"abstract":"<p>Gait recognition is a non-intrusive method that captures unique walking patterns without subject cooperation, which has emerged as a promising technique across various fields. Recent studies based on Deep Neural Networks (DNNs) have notably improved the performance, however, the potential vulnerability inherent in DNNs and their resistance to interference in practical gait recognition systems remain under-explored. To fill the gap, in this paper, we focus on imperceptible adversarial attack for deep gait recognition and propose an edge-oriented attack strategy tailored for silhouette-based approaches. Specifically, we make a pioneering attempt to explore the intrinsic characteristics of binary silhouettes, with a primary focus on injecting noise perturbations into the edge area. This simple yet effective solution enables sparse attack in both the spatial and temporal dimensions, which largely ensures imperceptibility and simultaneously achieves high success rate. In particular, our solution is built on a unified framework, allowing seamless switching between untargeted and targeted attack modes. Extensive experiments conducted on in-the-lab and in-the-wild benchmarks validate the effectiveness of our attack strategy and emphasize the necessity to study adversarial attack and defense strategy in the near future.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"54 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142405348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DLRA-Net: Deep Local Residual Attention Network with Contextual Refinement for Spectral Super-Resolution DLRA-Net:用于光谱超分辨率的具有上下文细化功能的深度局部残留注意力网络
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-09 DOI: 10.1007/s11263-024-02238-w
Ahmed R. El-gabri, Hussein A. Aly, Tarek S. Ghoniemy, Mohamed A. Elshafey

Hyperspectral Images (HSIs) provide detailed scene insights using extensive spectral bands, crucial for material discrimination and earth observation with substantial costs and low spatial resolution. Recently, Convolutional Neural Networks (CNNs) are common choice for Spectral Super-Resolution (SSR) from Multispectral Images (MSIs). However, they often fail to simultaneously exploit pixel-level noise degradation of MSIs and complex contextual spatial-spectral characteristics of HSIs. In this paper, a Deep Local Residual Attention Network with Contextual Refinement Network (DLRA-Net) is proposed to integrate local low-rank spectral and global contextual priors for improved SSR. Specifically, SSR is unfolded into Contextual-attention Refinement Module (CRM) and Dual Local Residual Attention Module (DLRAM). CRM is proposed to adaptively learn complex contextual priors to guide the convolution layer weights for improved spatial restorations. While DLRAM captures deep refined texture details to enhance contextual priors representations for recovering HSIs. Moreover, lateral fusion strategy is designed to integrate the obtained priors among DLRAMs for faster network convergence. Experimental results on natural-scene datasets with practical noise patterns confirm exceptional DLRA-Net performance with relatively small model size. DLRA-Net demonstrates Maximum Relative Improvements (MRI) between 9.71 and 58.58% in Mean Relative Absolute Error (MRAE) with reduced parameters between 52.18 and 85.85%. Besides, a practical RS-HSI dataset is generated for evaluations showing MRI between 8.64 and 50.56% in MRAE. Furthermore, experiments with HSI classifiers indicate improved performance of reconstructed RS-HSIs compared to RS-MSIs, with MRI in Overall Accuracy (OA) between 7.10 and 15.27%. Lastly, a detailed ablation study assesses model complexity and runtime.

高光谱图像(HSIs)利用广泛的光谱波段提供了详细的场景洞察,对于材料鉴别和地球观测至关重要,但成本高昂且空间分辨率低。最近,卷积神经网络(CNN)成为多光谱图像(MSI)光谱超分辨率(SSR)的常见选择。然而,它们往往无法同时利用 MSIs 的像素级噪声退化和 HSIs 的复杂上下文空间光谱特征。本文提出了一种具有上下文细化网络(DLRA-Net)的深度局部残留注意力网络,以整合局部低阶光谱和全局上下文先验,从而改进 SSR。具体来说,SSR 被展开为上下文注意细化模块(CRM)和双本地残差注意模块(DLRAM)。CRM 用于自适应学习复杂的上下文先验,以指导卷积层权重,从而改进空间复原。而 DLRAM 则能捕捉深层精细纹理细节,以增强上下文先验表征,从而恢复人机交互信号。此外,还设计了横向融合策略,以整合 DLRAM 之间获得的先验信息,从而加快网络收敛速度。在具有实际噪声模式的自然场景数据集上进行的实验结果证实,DLRA-Net 在模型规模相对较小的情况下性能卓越。DLRA-Net 的最大相对改进(MRI)介于 9.71% 和 58.58% 之间,平均相对绝对误差(MRAE)介于 52.18% 和 85.85% 之间。此外,还生成了一个实用的 RS-HSI 数据集进行评估,结果显示平均相对绝对误差(MRAE)在 8.64% 和 50.56% 之间。此外,HSI 分类器的实验表明,与 RS-MSI 相比,重建 RS-HSI 的性能有所提高,MRI 的总体准确率(OA)介于 7.10% 和 15.27% 之间。最后,详细的消融研究评估了模型的复杂性和运行时间。
{"title":"DLRA-Net: Deep Local Residual Attention Network with Contextual Refinement for Spectral Super-Resolution","authors":"Ahmed R. El-gabri, Hussein A. Aly, Tarek S. Ghoniemy, Mohamed A. Elshafey","doi":"10.1007/s11263-024-02238-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02238-w","url":null,"abstract":"<p>Hyperspectral Images (HSIs) provide detailed scene insights using extensive spectral bands, crucial for material discrimination and earth observation with substantial costs and low spatial resolution. Recently, Convolutional Neural Networks (CNNs) are common choice for Spectral Super-Resolution (SSR) from Multispectral Images (MSIs). However, they often fail to simultaneously exploit pixel-level noise degradation of MSIs and complex contextual spatial-spectral characteristics of HSIs. In this paper, a Deep Local Residual Attention Network with Contextual Refinement Network (DLRA-Net) is proposed to integrate local low-rank spectral and global contextual priors for improved SSR. Specifically, SSR is unfolded into Contextual-attention Refinement Module (CRM) and Dual Local Residual Attention Module (DLRAM). CRM is proposed to adaptively learn complex contextual priors to guide the convolution layer weights for improved spatial restorations. While DLRAM captures deep refined texture details to enhance contextual priors representations for recovering HSIs. Moreover, lateral fusion strategy is designed to integrate the obtained priors among DLRAMs for faster network convergence. Experimental results on natural-scene datasets with practical noise patterns confirm exceptional DLRA-Net performance with relatively small model size. DLRA-Net demonstrates Maximum Relative Improvements (MRI) between 9.71 and 58.58% in Mean Relative Absolute Error (MRAE) with reduced parameters between 52.18 and 85.85%. Besides, a practical RS-HSI dataset is generated for evaluations showing MRI between 8.64 and 50.56% in MRAE. Furthermore, experiments with HSI classifiers indicate improved performance of reconstructed RS-HSIs compared to RS-MSIs, with MRI in Overall Accuracy (OA) between 7.10 and 15.27%. Lastly, a detailed ablation study assesses model complexity and runtime.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"43 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142397921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining Generalized Multi-timescale Inconsistency for Detecting Deepfake Videos 挖掘广义多时间尺度不一致性以检测深度伪造视频
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-09 DOI: 10.1007/s11263-024-02249-7
Yang Yu, Rongrong Ni, Siyuan Yang, Yu Ni, Yao Zhao, Alex C. Kot

Recent advancements in face forgery techniques have continuously evolved, leading to emergent security concerns in society. Existing detection methods have poor generalization ability due to the insufficient extraction of dynamic inconsistency cues on the one hand, and their inability to deal well with the gaps between forgery techniques on the other hand. To develop a new generalized framework that emphasizes extracting generalizable multi-timescale inconsistency cues. Firstly, we capture subtle dynamic inconsistency via magnifying the multipath dynamic inconsistency from the local-consecutive short-term temporal view. Secondly, the inter-group graph learning is conducted to establish the sufficient-interactive long-term temporal view for capturing dynamic inconsistency comprehensively. Finally, we design the domain alignment module to directly reduce the distribution gaps via simultaneously disarranging inter- and intra-domain feature distributions for obtaining a more generalized framework. Extensive experiments on six large-scale datasets and the designed generalization evaluation protocols show that our framework outperforms state-of-the-art deepfake video detection methods.

近年来,人脸伪造技术不断发展,引发了社会对安全问题的关注。现有的检测方法一方面由于动态不一致线索提取不足,另一方面由于无法很好地处理不同伪造技术之间的差距,因此通用能力较差。开发一种新的通用框架,强调提取可通用的多时间尺度不一致性线索。首先,我们从局部连续的短期时间视角出发,通过放大多路径动态不一致性来捕捉微妙的动态不一致性。其次,通过组间图学习,建立充分交互的长期时间视图,以全面捕捉动态不一致性。最后,我们设计了领域对齐模块,通过同时打乱领域间和领域内的特征分布来直接缩小分布差距,从而获得一个更具通用性的框架。在六个大规模数据集上进行的广泛实验和设计的泛化评估协议表明,我们的框架优于最先进的深度假视频检测方法。
{"title":"Mining Generalized Multi-timescale Inconsistency for Detecting Deepfake Videos","authors":"Yang Yu, Rongrong Ni, Siyuan Yang, Yu Ni, Yao Zhao, Alex C. Kot","doi":"10.1007/s11263-024-02249-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02249-7","url":null,"abstract":"<p>Recent advancements in face forgery techniques have continuously evolved, leading to emergent security concerns in society. Existing detection methods have poor generalization ability due to the insufficient extraction of dynamic inconsistency cues on the one hand, and their inability to deal well with the gaps between forgery techniques on the other hand. To develop a new generalized framework that emphasizes extracting generalizable multi-timescale inconsistency cues. Firstly, we capture subtle dynamic inconsistency via magnifying the multipath dynamic inconsistency from the local-consecutive short-term temporal view. Secondly, the inter-group graph learning is conducted to establish the sufficient-interactive long-term temporal view for capturing dynamic inconsistency comprehensively. Finally, we design the domain alignment module to directly reduce the distribution gaps via simultaneously disarranging inter- and intra-domain feature distributions for obtaining a more generalized framework. Extensive experiments on six large-scale datasets and the designed generalization evaluation protocols show that our framework outperforms state-of-the-art deepfake video detection methods.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"99 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142398163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation MosaicFusion:扩散模型作为大词汇实例分割的数据增强器
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-08 DOI: 10.1007/s11263-024-02223-3
Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, Chen Change Loy

We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion.

我们提出的 MosaicFusion 是一种简单而有效的基于扩散的数据增强方法,适用于大词汇量实例分割。我们的方法无需训练,也不依赖任何标签监督。两个关键设计使我们能够使用现成的文本到图像扩散模型,作为对象实例和掩码注释的有用数据集生成器。首先,我们将图像画布划分为多个区域,然后根据不同的文本提示,执行一轮扩散过程,同时生成多个实例。其次,我们通过跨层和跨扩散时间步骤聚合与对象提示相关的交叉注意图,然后进行简单的阈值处理和边缘感知细化处理,从而获得相应的实例掩码。我们的 MosaicFusion 不需要繁琐的程序,就能为稀有和新颖类别生成大量的合成标注数据。在具有挑战性的LVIS长尾词和开放词汇基准上的实验结果表明,MosaicFusion可以显著提高现有实例分割模型的性能,尤其是在稀有和新颖类别方面。代码:https://github.com/Jiahao000/MosaicFusion。
{"title":"MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation","authors":"Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, Chen Change Loy","doi":"10.1007/s11263-024-02223-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02223-3","url":null,"abstract":"<p>We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: https://github.com/Jiahao000/MosaicFusion.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"225 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Unreliable Pseudo-Labels for Label-Efficient Semantic Segmentation 使用不可靠的伪标签进行标签高效语义分割
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-08 DOI: 10.1007/s11263-024-02229-x
Haochen Wang, Yuchao Wang, Yujun Shen, Junsong Fan, Yuxi Wang, Zhaoxiang Zhang

The crux of label-efficient semantic segmentation is to produce high-quality pseudo-labels to leverage a large amount of unlabeled or weakly labeled data. A common practice is to select the highly confident predictions as the pseudo-ground-truths for each pixel, but it leads to a problem that most pixels may be left unused due to their unreliability. However, we argue that every pixel matters to the model training, even those unreliable and ambiguous pixels. Intuitively, an unreliable prediction may get confused among the top classes, however, it should be confident about the pixel not belonging to the remaining classes. Hence, such a pixel can be convincingly treated as a negative key to those most unlikely categories. Therefore, we develop an effective pipeline to make sufficient use of unlabeled data. Concretely, we separate reliable and unreliable pixels via the entropy of predictions, push each unreliable pixel to a category-wise queue that consists of negative keys, and manage to train the model with all candidate pixels. Considering the training evolution, we adaptively adjust the threshold for the reliable-unreliable partition. Experimental results on various benchmarks and training settings demonstrate the superiority of our approach over the state-of-the-art alternatives.

标签高效语义分割的关键在于生成高质量的伪标签,以充分利用大量未标签或弱标签数据。常见的做法是选择可信度高的预测作为每个像素的伪地面真值,但这会导致一个问题,即大多数像素可能会因为不可靠而被弃用。然而,我们认为每个像素点对模型训练都很重要,即使是那些不可靠和模糊的像素点。直观地说,不可靠的预测可能会混淆最高级别,但它应该确信该像素不属于其余类别。因此,这样的像素可以被令人信服地视为最不可能类别的负面关键。因此,我们开发了一种有效的方法来充分利用未标记数据。具体来说,我们通过预测熵来区分可靠像素和不可靠像素,将每个不可靠像素推送到由负键组成的类别队列中,并设法用所有候选像素来训练模型。考虑到训练的演变,我们会自适应地调整可靠-不可靠分区的阈值。在各种基准和训练设置上的实验结果表明,我们的方法优于最先进的替代方法。
{"title":"Using Unreliable Pseudo-Labels for Label-Efficient Semantic Segmentation","authors":"Haochen Wang, Yuchao Wang, Yujun Shen, Junsong Fan, Yuxi Wang, Zhaoxiang Zhang","doi":"10.1007/s11263-024-02229-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02229-x","url":null,"abstract":"<p>The crux of label-efficient semantic segmentation is to produce high-quality pseudo-labels to leverage a large amount of unlabeled or weakly labeled data. A common practice is to select the highly confident predictions as the pseudo-ground-truths for each pixel, but it leads to a problem that most pixels may be left unused due to their unreliability. However, we argue that <i>every pixel matters to the model training</i>, even those unreliable and ambiguous pixels. Intuitively, an unreliable prediction may get confused among the top classes, however, it should be confident about the pixel not belonging to the remaining classes. Hence, such a pixel can be convincingly treated as a negative key to those most unlikely categories. Therefore, we develop an effective pipeline to make sufficient use of unlabeled data. Concretely, we separate reliable and unreliable pixels via the entropy of predictions, push each unreliable pixel to a category-wise queue that consists of negative keys, and manage to train the model with all candidate pixels. Considering the training evolution, we adaptively adjust the threshold for the reliable-unreliable partition. Experimental results on various benchmarks and training settings demonstrate the superiority of our approach over the state-of-the-art alternatives.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"12 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Group-Based Distinctive Image Captioning with Memory Difference Encoding and Attention 利用记忆差异编码和注意力进行基于组别的独特图像字幕制作
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-10-08 DOI: 10.1007/s11263-024-02220-6
Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan

Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy. Moreover, the results of our user study are consistent with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.

图像标题制作的最新进展主要集中在通过大幅增加数据集和模型规模来提高准确性。虽然传统的字幕模型在 BLEU、CIDEr 和 SPICE 等既定指标上表现出很高的性能,但字幕将目标图像与其他类似图像区分开来的能力还未得到充分探索。为了生成与众不同的标题,一些先驱者采用了对比学习或重新加权地面实况标题的方法。然而,这些方法往往忽略了相似图像组中对象之间的关系(例如,同一相册中的项目或属性或细粒度事件)。在本文中,我们介绍了一种增强图像标题独特性的新方法,即基于组的差异化标题方法,该方法将每张图像与一个相似组中的其他图像进行直观比较,并突出每张图像的独特性。特别是,我们引入了一个基于组的差异记忆注意力(GDMA)模块,旨在识别和强调图像中在其图像组内可独特区分的物体特征,即那些与其他图像中的物体相似度较低的特征。这一机制可确保在为图像生成标题时优先考虑这些独特的对象特征,从而增强生成的标题的独特性。为了进一步完善这一过程,我们从地面实况字幕中选择了一些独特的词语来指导语言解码器和 GDMA 模块。此外,我们还提出了一个新的评估指标--独特词率(DisWordRate),用于定量评估字幕的独特性。定量结果表明,所提出的方法显著提高了几个基线模型的显著性,并在不过分牺牲准确性的情况下实现了最先进的显著性性能。此外,我们的用户研究结果与定量评估结果一致,证明了新指标 DisWordRate 的合理性。
{"title":"Group-Based Distinctive Image Captioning with Memory Difference Encoding and Attention","authors":"Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan","doi":"10.1007/s11263-024-02220-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02220-6","url":null,"abstract":"<p>Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy. Moreover, the results of our user study are consistent with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142384343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1