首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Targeted adversarial attack on classic vision pipelines 对经典视觉管道的针对性对抗攻击
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-06 DOI: 10.1016/j.cviu.2024.104140

Deep networks are susceptible to adversarial attacks. End-to-end differentiability of deep networks provides the analytical formulation which has aided in proliferation of diverse adversarial attacks. On the contrary, handcrafted pipelines (local feature matching, bag-of-words based place recognition, and visual tracking) consist of intuitive approaches and perhaps lack end-to-end formal description. In this work, we show that classic handcrafted pipelines are also susceptible to adversarial attacks.

We propose a novel targeted adversarial attack for multiple well-known handcrafted pipelines and datasets. Our attack is able to match an image with any given target image which can be completely different from the original image. Our approach manages to attack simple (image registration) as well as sophisticated multi-stage (place recognition (FAB-MAP), visual tracking (ORB-SLAM3)) pipelines. We outperform multiple baselines over different public datasets (Places, KITTI and HPatches).

Our analysis shows that although vulnerable, achieving true imperceptibility is harder in case of targeted attack on handcrafted pipelines. To this end, we propose a stealthy attack where the noise is perceptible but appears benign. In order to assist the community in further examining the weakness of popular handcrafted pipelines we release our code.

深度网络容易受到恶意攻击。深度网络的端到端可分性提供了分析表述,有助于各种对抗性攻击的扩散。相反,手工管道(局部特征匹配、基于词袋的地点识别和视觉跟踪)由直观方法组成,可能缺乏端到端的正式描述。在这项工作中,我们证明了经典的手工管道也容易受到对抗性攻击。我们针对多个著名的手工管道和数据集提出了一种新的有针对性的对抗性攻击。我们的攻击能够将图像与任何给定的目标图像相匹配,而目标图像可能与原始图像完全不同。我们的方法既能攻击简单的(图像注册),也能攻击复杂的多阶段(位置识别(FAB-MAP)、视觉跟踪(ORB-SLAM3))管道。我们的分析表明,尽管存在漏洞,但在对手工管道进行有针对性攻击的情况下,要实现真正的不可感知性更加困难。为此,我们提出了一种隐形攻击,在这种攻击中,噪声是可感知的,但看起来是无害的。为了帮助社区进一步研究流行的手工管道的弱点,我们发布了我们的代码。
{"title":"Targeted adversarial attack on classic vision pipelines","authors":"","doi":"10.1016/j.cviu.2024.104140","DOIUrl":"10.1016/j.cviu.2024.104140","url":null,"abstract":"<div><p>Deep networks are susceptible to adversarial attacks. End-to-end differentiability of deep networks provides the analytical formulation which has aided in proliferation of diverse adversarial attacks. On the contrary, handcrafted pipelines (local feature matching, bag-of-words based place recognition, and visual tracking) consist of intuitive approaches and perhaps lack end-to-end formal description. In this work, we show that classic handcrafted pipelines are also susceptible to adversarial attacks.</p><p>We propose a novel targeted adversarial attack for multiple well-known handcrafted pipelines and datasets. Our attack is able to match an image with any given target image which can be completely different from the original image. Our approach manages to attack simple (image registration) as well as sophisticated multi-stage (place recognition (FAB-MAP), visual tracking (ORB-SLAM3)) pipelines. We outperform multiple baselines over different public datasets (Places, KITTI and HPatches).</p><p>Our analysis shows that although vulnerable, achieving true imperceptibility is harder in case of targeted attack on handcrafted pipelines. To this end, we propose a stealthy attack where the noise is perceptible but appears benign. In order to assist the community in further examining the weakness of popular handcrafted pipelines we release our code.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DBMHT: A double-branch multi-hypothesis transformer for 3D human pose estimation in video DBMHT:用于视频中三维人体姿态估计的双分支多假设变换器
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-06 DOI: 10.1016/j.cviu.2024.104147

The estimation of 3D human poses from monocular videos presents a significant challenge. The existing methods face the problems of deep ambiguity and self-occlusion. To overcome these problems, we propose a Double-Branch Multi-Hypothesis Transformer (DBMHT). In detail, we utilize a Double-Branch architecture to capture temporal and spatial information and generate multiple hypotheses. To merge these hypotheses, we adopt a lightweight module to integrate spatial and temporal representations. The DBMHT can not only capture spatial information from each joint in the human body and temporal information from each frame in the video but also merge multiple hypotheses that have different spatio-temporal information. Comprehensive evaluation on two challenging datasets (i.e. Human3.6M and MPI-INF-3DHP) demonstrates the superior performance of DBMHT, marking it as a robust and efficient approach for accurate 3D HPE in dynamic scenarios. The results show that our model surpasses the state-of-the-art approach by 1.9% MPJPE with ground truth 2D keypoints as input.

从单目视频中估计三维人体姿态是一项重大挑战。现有方法面临着深度模糊性和自我封闭性的问题。为了克服这些问题,我们提出了双分支多假设变换器(DBMHT)。具体来说,我们利用双分支架构捕捉时间和空间信息并生成多个假设。为了合并这些假设,我们采用了一个轻量级模块来整合空间和时间表征。DBMHT 不仅能捕捉人体每个关节的空间信息和视频中每一帧的时间信息,还能合并具有不同时空信息的多个假设。在两个具有挑战性的数据集(即 Human3.6M 和 MPI-INF-3DHP)上进行的综合评估证明了 DBMHT 的优越性能,使其成为在动态场景中实现精确三维 HPE 的稳健而高效的方法。结果表明,我们的模型在输入地面真实 2D 关键点的情况下,MPJPE 比最先进的方法高出 1.9%。
{"title":"DBMHT: A double-branch multi-hypothesis transformer for 3D human pose estimation in video","authors":"","doi":"10.1016/j.cviu.2024.104147","DOIUrl":"10.1016/j.cviu.2024.104147","url":null,"abstract":"<div><p>The estimation of 3D human poses from monocular videos presents a significant challenge. The existing methods face the problems of deep ambiguity and self-occlusion. To overcome these problems, we propose a Double-Branch Multi-Hypothesis Transformer (DBMHT). In detail, we utilize a Double-Branch architecture to capture temporal and spatial information and generate multiple hypotheses. To merge these hypotheses, we adopt a lightweight module to integrate spatial and temporal representations. The DBMHT can not only capture spatial information from each joint in the human body and temporal information from each frame in the video but also merge multiple hypotheses that have different spatio-temporal information. Comprehensive evaluation on two challenging datasets (i.e. Human3.6M and MPI-INF-3DHP) demonstrates the superior performance of DBMHT, marking it as a robust and efficient approach for accurate 3D HPE in dynamic scenarios. The results show that our model surpasses the state-of-the-art approach by 1.9% MPJPE with ground truth 2D keypoints as input.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Continuous fake media detection: Adapting deepfake detectors to new generative techniques 连续假媒体检测:让深度假货检测器适应新的生成技术
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-06 DOI: 10.1016/j.cviu.2024.104143

Generative techniques continue to evolve at an impressively high rate, driven by the hype about these technologies. This rapid advancement severely limits the application of deepfake detectors, which, despite numerous efforts by the scientific community, struggle to achieve sufficiently robust performance against the ever-changing content. To address these limitations, in this paper, we propose an analysis of two continuous learning techniques on a Short and a Long sequence of fake media. Both sequences include a complex and heterogeneous range of deepfakes (generated images and videos) from GANs, computer graphics techniques, and unknown sources. Our experiments show that continual learning could be important in mitigating the need for generalizability. In fact, we show that, although with some limitations, continual learning methods help to maintain good performance across the entire training sequence. For these techniques to work in a sufficiently robust way, however, it is necessary that the tasks in the sequence share similarities. In fact, according to our experiments, the order and similarity of the tasks can affect the performance of the models over time. To address this problem, we show that it is possible to group tasks based on their similarity. This small measure allows for a significant improvement even in longer sequences. This result suggests that continual techniques can be combined with the most promising detection methods, allowing them to catch up with the latest generative techniques. In addition to this, we propose an overview of how this learning approach can be integrated into a deepfake detection pipeline for continuous integration and continuous deployment (CI/CD). This allows you to keep track of different funds, such as social networks, new generative tools, or third-party datasets, and through the integration of continuous learning, allows constant maintenance of the detectors.

在这些技术炒作的推动下,生成技术继续以惊人的速度发展。这种快速发展严重限制了深度假货检测器的应用,尽管科学界做出了许多努力,但这些检测器仍难以在不断变化的内容面前获得足够强大的性能。为了解决这些局限性,我们在本文中提出了对两种持续学习技术在短篇和长篇虚假媒体序列上的分析。这两个序列都包括来自 GAN、计算机图形技术和未知来源的复杂、异构的深度伪造(生成的图像和视频)。我们的实验表明,持续学习对于减少对通用性的需求非常重要。事实上,我们表明,尽管有一些局限性,但持续学习方法有助于在整个训练序列中保持良好的性能。不过,要使这些技术以足够稳健的方式发挥作用,序列中的任务必须具有相似性。事实上,根据我们的实验,任务的顺序和相似性会随着时间的推移影响模型的性能。为了解决这个问题,我们证明可以根据任务的相似性对其进行分组。即使在较长的序列中,这种小措施也能显著提高性能。这一结果表明,持续性技术可以与最有前途的检测方法相结合,从而赶上最新的生成技术。除此之外,我们还概述了如何将这种学习方法集成到持续集成和持续部署(CI/CD)的深度伪造检测管道中。这样,您就可以跟踪不同的基金,如社交网络、新生成工具或第三方数据集,并通过集成持续学习,对检测器进行持续维护。
{"title":"Continuous fake media detection: Adapting deepfake detectors to new generative techniques","authors":"","doi":"10.1016/j.cviu.2024.104143","DOIUrl":"10.1016/j.cviu.2024.104143","url":null,"abstract":"<div><p>Generative techniques continue to evolve at an impressively high rate, driven by the hype about these technologies. This rapid advancement severely limits the application of deepfake detectors, which, despite numerous efforts by the scientific community, struggle to achieve sufficiently robust performance against the ever-changing content. To address these limitations, in this paper, we propose an analysis of two continuous learning techniques on a <em>Short</em> and a <em>Long</em> sequence of fake media. Both sequences include a complex and heterogeneous range of deepfakes (generated images and videos) from GANs, computer graphics techniques, and unknown sources. Our experiments show that continual learning could be important in mitigating the need for generalizability. In fact, we show that, although with some limitations, continual learning methods help to maintain good performance across the entire training sequence. For these techniques to work in a sufficiently robust way, however, it is necessary that the tasks in the sequence share similarities. In fact, according to our experiments, the order and similarity of the tasks can affect the performance of the models over time. To address this problem, we show that it is possible to group tasks based on their similarity. This small measure allows for a significant improvement even in longer sequences. This result suggests that continual techniques can be combined with the most promising detection methods, allowing them to catch up with the latest generative techniques. In addition to this, we propose an overview of how this learning approach can be integrated into a deepfake detection pipeline for continuous integration and continuous deployment (CI/CD). This allows you to keep track of different funds, such as social networks, new generative tools, or third-party datasets, and through the integration of continuous learning, allows constant maintenance of the detectors.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002248/pdfft?md5=055418833f110c748b5c22d95d3c42b9&pid=1-s2.0-S1077314224002248-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142240200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Agglomerator++: Interpretable part-whole hierarchies and latent space representations in neural networks Agglomerator++:神经网络中可解释的部分-整体层次结构和潜在空间表示法
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-06 DOI: 10.1016/j.cviu.2024.104159

Deep neural networks achieve outstanding results in a large variety of tasks, often outperforming human experts. However, a known limitation of current neural architectures is the poor accessibility in understanding and interpreting the network’s response to a given input. This is directly related to the huge number of variables and the associated non-linearities of neural models, which are often used as black boxes. This lack of transparency, particularly in crucial areas like autonomous driving, security, and healthcare, can trigger skepticism and limit trust, despite the networks’ high performance. In this work, we want to advance the interpretability in neural networks. We present Agglomerator++, a framework capable of providing a representation of part-whole hierarchies from visual cues and organizing the input distribution to match the conceptual-semantic hierarchical structure between classes. We evaluate our method on common datasets, such as SmallNORB, MNIST, FashionMNIST, CIFAR-10, and CIFAR-100, showing that our solution delivers a more interpretable model compared to other state-of-the-art approaches. Our code is available at https://mmlab-cv.github.io/Agglomeratorplusplus/.

深度神经网络在各种任务中都取得了出色的成绩,其表现往往优于人类专家。然而,当前神经架构的一个已知局限是,在理解和解释网络对给定输入的响应时,可访问性较差。这与神经模型的大量变量和相关非线性因素直接相关,而神经模型通常被当作黑盒子使用。这种缺乏透明度的情况,尤其是在自动驾驶、安全和医疗保健等关键领域,会引发怀疑并限制信任,尽管网络具有很高的性能。在这项工作中,我们希望提高神经网络的可解释性。我们提出的 Agglomerator++ 是一个框架,它能够从视觉线索中提供部分-整体层次结构的表示,并组织输入分布以匹配类之间的概念-语义层次结构。我们在 SmallNORB、MNIST、FashionMNIST、CIFAR-10 和 CIFAR-100 等常见数据集上对我们的方法进行了评估,结果表明,与其他最先进的方法相比,我们的解决方案提供的模型更具可解释性。我们的代码见 https://mmlab-cv.github.io/Agglomeratorplusplus/。
{"title":"Agglomerator++: Interpretable part-whole hierarchies and latent space representations in neural networks","authors":"","doi":"10.1016/j.cviu.2024.104159","DOIUrl":"10.1016/j.cviu.2024.104159","url":null,"abstract":"<div><p>Deep neural networks achieve outstanding results in a large variety of tasks, often outperforming human experts. However, a known limitation of current neural architectures is the poor accessibility in understanding and interpreting the network’s response to a given input. This is directly related to the huge number of variables and the associated non-linearities of neural models, which are often used as black boxes. This lack of transparency, particularly in crucial areas like autonomous driving, security, and healthcare, can trigger skepticism and limit trust, despite the networks’ high performance. In this work, we want to advance the interpretability in neural networks. We present Agglomerator++, a framework capable of providing a representation of part-whole hierarchies from visual cues and organizing the input distribution to match the conceptual-semantic hierarchical structure between classes. We evaluate our method on common datasets, such as SmallNORB, MNIST, FashionMNIST, CIFAR-10, and CIFAR-100, showing that our solution delivers a more interpretable model compared to other state-of-the-art approaches. Our code is available at <span><span>https://mmlab-cv.github.io/Agglomeratorplusplus/</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002406/pdfft?md5=ad401203069cc93800237abddffe0b0d&pid=1-s2.0-S1077314224002406-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pyramid transformer-based triplet hashing for robust visual place recognition 基于金字塔变换器的三重哈希算法用于稳健的视觉地点识别
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-06 DOI: 10.1016/j.cviu.2024.104167

Deep hashing is being used to approximate nearest neighbor search for large-scale image recognition problems. However, CNN architectures have dominated similar applications. We present a Pyramid Transformer-based Triplet Hashing architecture to handle large-scale place recognition challenges in this study, leveraging the capabilities of Vision Transformer (ViT). For feature representation, we create a Siamese Pyramid Transformer backbone. We present a multi-scale feature aggregation technique to learn discriminative features for scale-invariant features. In addition, we observe that binary codes suitable for place recognition are sub-optimal. To overcome this issue, we use a self-restraint triplet loss deep learning network to create compact hash codes, further increasing recognition accuracy. To the best of our knowledge, this is the first study to use a triplet loss deep learning network to handle the deep hashing learning problem. We do extensive experiments on four difficult place datasets: KITTI, Nordland, VPRICE, and EuRoC. The experimental findings reveal that the suggested technique performs at the cutting edge of large-scale visual place recognition challenges.

深度散列正被用于近似近邻搜索,以解决大规模图像识别问题。然而,CNN 架构在类似应用中占主导地位。在本研究中,我们利用视觉变换器(ViT)的功能,提出了一种基于金字塔变换器的三重散列架构,以应对大规模地点识别挑战。在特征表示方面,我们创建了一个连体金字塔变换器骨干。我们提出了一种多尺度特征聚合技术,以学习尺度不变特征的判别特征。此外,我们发现适用于地点识别的二进制代码是次优的。为了克服这一问题,我们使用自约束三重损失深度学习网络来创建紧凑的哈希代码,从而进一步提高识别准确率。据我们所知,这是首个使用三重损失深度学习网络来处理深度散列学习问题的研究。我们在四个难度较大的地方数据集上进行了广泛的实验:KITTI、Nordland、VPRICE 和 EuRoC。实验结果表明,所建议的技术在大规模视觉地点识别挑战中处于领先地位。
{"title":"Pyramid transformer-based triplet hashing for robust visual place recognition","authors":"","doi":"10.1016/j.cviu.2024.104167","DOIUrl":"10.1016/j.cviu.2024.104167","url":null,"abstract":"<div><p>Deep hashing is being used to approximate nearest neighbor search for large-scale image recognition problems. However, CNN architectures have dominated similar applications. We present a Pyramid Transformer-based Triplet Hashing architecture to handle large-scale place recognition challenges in this study, leveraging the capabilities of Vision Transformer (ViT). For feature representation, we create a Siamese Pyramid Transformer backbone. We present a multi-scale feature aggregation technique to learn discriminative features for scale-invariant features. In addition, we observe that binary codes suitable for place recognition are sub-optimal. To overcome this issue, we use a self-restraint triplet loss deep learning network to create compact hash codes, further increasing recognition accuracy. To the best of our knowledge, this is the first study to use a triplet loss deep learning network to handle the deep hashing learning problem. We do extensive experiments on four difficult place datasets: KITTI, Nordland, VPRICE, and EuRoC. The experimental findings reveal that the suggested technique performs at the cutting edge of large-scale visual place recognition challenges.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bypass network for semantics driven image paragraph captioning 语义驱动的图像段落标题旁路网络
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-06 DOI: 10.1016/j.cviu.2024.104154

Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire the next topic vector, which reduces immediate repetition by eliminating linguistics. The latter decodes the topic vector and the preceding syntax state to produce the following sentence. To further reduce delayed repetition in generated paragraphs, we devise a replacement-based reward for the REINFORCE training. Comprehensive experiments on the widely used benchmark demonstrate the superiority of the proposed model over the state of the art for coherence while maintaining high accuracy.

图像段落标题旨在用一连串连贯的句子来描述给定图像。现有的大多数方法都是通过主题转换来建立连贯性模型的,主题转换可以动态地从前面的句子中推断出主题向量。然而,这些方法在生成的段落中仍然存在即时或延迟重复的问题,这是因为:(i) 句法和语义的纠缠分散了主题向量对相关视觉区域的注意力;(ii) 学习长距离转换的约束或奖励很少。在本文中,我们提出了一种旁路网络,它能分别对前一句的语义和语言句法进行建模。具体来说,我们提出的模型由两个主要模块组成,即主题转换模块和句子生成模块。前者将之前的语义向量作为查询,并对区域特征应用注意力机制,以获取下一个主题向量,通过消除语病来减少直接重复。后者对主题向量和前一个句法状态进行解码,生成下一个句子。为了进一步减少生成段落中的延迟重复,我们为 REINFORCE 训练设计了一种基于替换的奖励机制。在广泛使用的基准上进行的综合实验证明,所提出的模型在保持高准确性的同时,在连贯性方面优于现有技术。
{"title":"Bypass network for semantics driven image paragraph captioning","authors":"","doi":"10.1016/j.cviu.2024.104154","DOIUrl":"10.1016/j.cviu.2024.104154","url":null,"abstract":"<div><p>Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire the next topic vector, which reduces immediate repetition by eliminating linguistics. The latter decodes the topic vector and the preceding syntax state to produce the following sentence. To further reduce delayed repetition in generated paragraphs, we devise a replacement-based reward for the REINFORCE training. Comprehensive experiments on the widely used benchmark demonstrate the superiority of the proposed model over the state of the art for coherence while maintaining high accuracy.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142168431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Class Probability Space Regularization for semi-supervised semantic segmentation 用于半监督语义分割的类概率空间正规化
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-05 DOI: 10.1016/j.cviu.2024.104146

Semantic segmentation achieves fine-grained scene parsing in any scenario, making it one of the key research directions to facilitate the development of human visual attention mechanisms. Recent advancements in semi-supervised semantic segmentation have attracted considerable attention due to their potential in leveraging unlabeled data. However, existing methods only focus on exploring the knowledge of unlabeled pixels with high certainty prediction. Their insufficient mining of low certainty regions of unlabeled data results in a significant loss of supervisory information. Therefore, this paper proposes the Class Probability Space Regularization (CPSR) approach to further exploit the potential of each unlabeled pixel. Specifically, we first design a class knowledge reshaping module to regularize the probability space of low certainty pixels, thereby transforming them into high certainty ones for supervised training. Furthermore, we propose a tail probability suppression module to suppress the probabilities of tailed classes, which facilitates the network to learn more discriminative information from the class probability space. Extensive experiments conducted on the PASCAL VOC2012 and Cityscapes datasets prove that our method achieves state-of-the-art performance without introducing much computational overhead. Code is available at https://github.com/MKSAQW/CPSR.

语义分割可在任何场景下实现细粒度场景解析,因此成为促进人类视觉注意力机制发展的关键研究方向之一。由于半监督语义分割在利用无标记数据方面的潜力,其最新进展引起了广泛关注。然而,现有的方法只专注于探索未标记像素的高确定性预测知识。这些方法对未标记数据的低确定性区域挖掘不足,导致监督信息的严重损失。因此,本文提出了类概率空间正则化(CPSR)方法,以进一步挖掘每个未标记像素的潜力。具体来说,我们首先设计了一个类知识重塑模块,对低确定性像素的概率空间进行正则化,从而将其转化为高确定性像素,用于监督训练。此外,我们还提出了尾部概率抑制模块,以抑制尾部类别的概率,从而促进网络从类别概率空间中学习更多的判别信息。在 PASCAL VOC2012 和 Cityscapes 数据集上进行的大量实验证明,我们的方法在不引入大量计算开销的情况下实现了最先进的性能。代码见 https://github.com/MKSAQW/CPSR。
{"title":"Class Probability Space Regularization for semi-supervised semantic segmentation","authors":"","doi":"10.1016/j.cviu.2024.104146","DOIUrl":"10.1016/j.cviu.2024.104146","url":null,"abstract":"<div><p>Semantic segmentation achieves fine-grained scene parsing in any scenario, making it one of the key research directions to facilitate the development of human visual attention mechanisms. Recent advancements in semi-supervised semantic segmentation have attracted considerable attention due to their potential in leveraging unlabeled data. However, existing methods only focus on exploring the knowledge of unlabeled pixels with high certainty prediction. Their insufficient mining of low certainty regions of unlabeled data results in a significant loss of supervisory information. Therefore, this paper proposes the <strong>C</strong>lass <strong>P</strong>robability <strong>S</strong>pace <strong>R</strong>egularization (<strong>CPSR</strong>) approach to further exploit the potential of each unlabeled pixel. Specifically, we first design a class knowledge reshaping module to regularize the probability space of low certainty pixels, thereby transforming them into high certainty ones for supervised training. Furthermore, we propose a tail probability suppression module to suppress the probabilities of tailed classes, which facilitates the network to learn more discriminative information from the class probability space. Extensive experiments conducted on the PASCAL VOC2012 and Cityscapes datasets prove that our method achieves state-of-the-art performance without introducing much computational overhead. Code is available at <span><span>https://github.com/MKSAQW/CPSR</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142240203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Embedding AI ethics into the design and use of computer vision technology for consumer’s behaviour understanding 将人工智能伦理纳入计算机视觉技术的设计和使用,以了解消费者行为
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-04 DOI: 10.1016/j.cviu.2024.104142

Artificial Intelligence (AI) techniques are becoming more and more sophisticated showing the potential to deeply understand and predict consumer behaviour in a way to boost the retail sector; however, retail-sensitive considerations underpinning their deployment have been poorly explored to date. This paper explores the application of AI technologies in the retail sector, focusing on their potential to enhance decision-making processes by preventing major ethical risks inherent to them, such as the propagation of bias and systems’ lack of explainability. Drawing on recent literature on AI ethics, this study proposes a methodological path for the design and the development of trustworthy, unbiased, and more explainable AI systems in the retail sector. Such framework grounds on European (EU) AI ethics principles and addresses the specific nuances of retail applications. To do this, we first examine the VRAI framework, a deep learning model used to analyse shopper interactions, people counting and re-identification, to highlight the critical need for transparency and fairness in AI operations. Second, the paper proposes actionable strategies for integrating high-level ethical guidelines into practical settings, and particularly, to mitigate biases leading to unfair outcomes in AI systems and improve their explainability. By doing so, the paper aims to show the key added value of embedding AI ethics requirements into AI practices and computer vision technology to truly promote technically and ethically robust AI in the retail domain.

人工智能(AI)技术正变得越来越复杂,显示出深入理解和预测消费者行为的潜力,从而促进零售业的发展;然而,迄今为止,对其部署所依据的零售业敏感考虑因素的探讨还很少。本文探讨了人工智能技术在零售业的应用,重点关注其通过防止固有的重大道德风险(如偏见传播和系统缺乏可解释性)来增强决策过程的潜力。本研究借鉴近期有关人工智能伦理的文献,提出了在零售业设计和开发可信、无偏见、更可解释的人工智能系统的方法论路径。该框架以欧洲(欧盟)人工智能伦理原则为基础,并针对零售应用的具体细微差别。为此,我们首先研究了 VRAI 框架(一种用于分析购物者互动、人员统计和重新识别的深度学习模型),以强调人工智能操作透明度和公平性的迫切需要。其次,本文提出了将高级伦理准则融入实际环境的可行策略,特别是减少导致人工智能系统出现不公平结果的偏见,并提高其可解释性。通过这样做,本文旨在展示将人工智能伦理要求嵌入人工智能实践和计算机视觉技术的关键附加值,从而在零售领域真正促进技术上和伦理上健全的人工智能。
{"title":"Embedding AI ethics into the design and use of computer vision technology for consumer’s behaviour understanding","authors":"","doi":"10.1016/j.cviu.2024.104142","DOIUrl":"10.1016/j.cviu.2024.104142","url":null,"abstract":"<div><p>Artificial Intelligence (AI) techniques are becoming more and more sophisticated showing the potential to deeply understand and predict consumer behaviour in a way to boost the retail sector; however, retail-sensitive considerations underpinning their deployment have been poorly explored to date. This paper explores the application of AI technologies in the retail sector, focusing on their potential to enhance decision-making processes by preventing major ethical risks inherent to them, such as the propagation of bias and systems’ lack of explainability. Drawing on recent literature on AI ethics, this study proposes a methodological path for the design and the development of trustworthy, unbiased, and more explainable AI systems in the retail sector. Such framework grounds on European (EU) AI ethics principles and addresses the specific nuances of retail applications. To do this, we first examine the VRAI framework, a deep learning model used to analyse shopper interactions, people counting and re-identification, to highlight the critical need for transparency and fairness in AI operations. Second, the paper proposes actionable strategies for integrating high-level ethical guidelines into practical settings, and particularly, to mitigate biases leading to unfair outcomes in AI systems and improve their explainability. By doing so, the paper aims to show the key added value of embedding AI ethics requirements into AI practices and computer vision technology to truly promote technically and ethically robust AI in the retail domain.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002236/pdfft?md5=e8ae2d0422401ca2e5087d68686b6387&pid=1-s2.0-S1077314224002236-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142148050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acoustic features analysis for explainable machine learning-based audio spoofing detection 基于可解释机器学习的音频欺骗检测的声学特征分析
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-04 DOI: 10.1016/j.cviu.2024.104145

The rapid evolution of synthetic voice generation and audio manipulation technologies poses significant challenges, raising societal and security concerns due to the risks of impersonation and the proliferation of audio deepfakes. This study introduces a lightweight machine learning (ML)-based framework designed to effectively distinguish between genuine and spoofed audio recordings. Departing from conventional deep learning (DL) approaches, which mainly rely on image-based spectrogram features or learning-based audio features, the proposed method utilizes a diverse set of hand-crafted audio features – such as spectral, temporal, chroma, and frequency-domain features – to enhance the accuracy of deepfake audio content detection. Through extensive evaluation and experiments on three well-known datasets, ASVSpoof2019, FakeAVCelebV2, and an In-The-Wild database, the proposed solution demonstrates robust performance and a high degree of generalization compared to state-of-the-art methods. In particular, our method achieved 89% accuracy on ASVSpoof2019, 94.5% on FakeAVCelebV2, and 94.67% on the In-The-Wild database. Additionally, the experiments performed on explainability techniques clarify the decision-making processes within ML models, enhancing transparency and identifying crucial features essential for audio deepfake detection.

合成语音生成和音频处理技术的快速发展带来了巨大的挑战,由于假冒风险和音频深度伪造的扩散,引发了社会和安全问题。本研究介绍了一种基于机器学习(ML)的轻量级框架,旨在有效区分真假音频录音。传统的深度学习(DL)方法主要依赖于基于图像的频谱图特征或基于学习的音频特征,而本研究提出的方法则不同,它利用了一系列手工创建的音频特征(如频谱、时间、色度和频域特征)来提高深度伪造音频内容检测的准确性。通过在 ASVSpoof2019、FakeAVCelebV2 和 In-The-Wild 数据库这三个著名的数据集上进行广泛的评估和实验,与最先进的方法相比,所提出的解决方案表现出稳健的性能和高度的通用性。特别是,我们的方法在 ASVSpoof2019 上达到了 89% 的准确率,在 FakeAVCelebV2 上达到了 94.5%,在 In-The-Wild 数据库上达到了 94.67%。此外,对可解释性技术进行的实验澄清了 ML 模型的决策过程,提高了透明度,并确定了音频深度伪造检测所必需的关键特征。
{"title":"Acoustic features analysis for explainable machine learning-based audio spoofing detection","authors":"","doi":"10.1016/j.cviu.2024.104145","DOIUrl":"10.1016/j.cviu.2024.104145","url":null,"abstract":"<div><p>The rapid evolution of synthetic voice generation and audio manipulation technologies poses significant challenges, raising societal and security concerns due to the risks of impersonation and the proliferation of audio deepfakes. This study introduces a lightweight machine learning (ML)-based framework designed to effectively distinguish between genuine and spoofed audio recordings. Departing from conventional deep learning (DL) approaches, which mainly rely on image-based spectrogram features or learning-based audio features, the proposed method utilizes a diverse set of hand-crafted audio features – such as spectral, temporal, chroma, and frequency-domain features – to enhance the accuracy of deepfake audio content detection. Through extensive evaluation and experiments on three well-known datasets, ASVSpoof2019, FakeAVCelebV2, and an In-The-Wild database, the proposed solution demonstrates robust performance and a high degree of generalization compared to state-of-the-art methods. In particular, our method achieved 89% accuracy on ASVSpoof2019, 94.5% on FakeAVCelebV2, and 94.67% on the In-The-Wild database. Additionally, the experiments performed on explainability techniques clarify the decision-making processes within ML models, enhancing transparency and identifying crucial features essential for audio deepfake detection.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224002261/pdfft?md5=fececaff6052bf1288f308ae6213933a&pid=1-s2.0-S1077314224002261-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CRML-Net: Cross-Modal Reasoning and Multi-Task Learning Network for tooth image segmentation CRML-Net:用于牙齿图像分割的跨模态推理和多任务学习网络
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-09-02 DOI: 10.1016/j.cviu.2024.104138

Data from a single modality may suffer from noise, low contrast, or other imaging limitations that affect the model’s accuracy. Furthermore, due to the limited amount of data, most models trained on single-modality data tend to overfit the training set and perform poorly on out-of-domain data. Therefore, in this paper, we propose a network named Cross-Modal Reasoning and Multi-Task Learning Network (CRML-Net), which combines cross-modal reasoning and multi-task learning, aiming to leverage the complementary information between different modalities and tasks to enhance the model’s generalization ability and accuracy. Specifically, CRML-Net consists of two stages. In the first stage, our network extracts a new morphological information modality from the original image and then performs cross-modal fusion with the original modality image, aiming to leverage the morphological information to enhance the model’s robustness to out-of-domain datasets. In the second stage, based on the output of the previous stage, we introduce a multi-task learning mechanism, aiming to improve the model’s performance on unseen data by sharing surface detail information from auxiliary tasks. We validated our method on a publicly available tooth cone beam computed tomography dataset. Our evaluation demonstrates that our method outperforms state-of-the-art approaches.

来自单一模式的数据可能存在噪声、低对比度或其他成像限制,从而影响模型的准确性。此外,由于数据量有限,大多数基于单模态数据训练的模型往往会过拟合训练集,在域外数据上表现不佳。因此,我们在本文中提出了一种名为 "跨模态推理和多任务学习网络"(Cross-Modal Reasoning and Multi-Task Learning Network,CRML-Net)的网络,它将跨模态推理和多任务学习相结合,旨在利用不同模态和任务之间的互补信息来提高模型的泛化能力和准确性。具体来说,CRML-Net 包括两个阶段。在第一阶段,我们的网络从原始图像中提取新的形态信息模态,然后与原始模态图像进行跨模态融合,旨在利用形态信息增强模型对域外数据集的鲁棒性。在第二阶段,基于前一阶段的输出,我们引入了多任务学习机制,旨在通过共享辅助任务的表面细节信息来提高模型在未见数据上的性能。我们在公开的牙齿锥形束计算机断层扫描数据集上验证了我们的方法。评估结果表明,我们的方法优于最先进的方法。
{"title":"CRML-Net: Cross-Modal Reasoning and Multi-Task Learning Network for tooth image segmentation","authors":"","doi":"10.1016/j.cviu.2024.104138","DOIUrl":"10.1016/j.cviu.2024.104138","url":null,"abstract":"<div><p>Data from a single modality may suffer from noise, low contrast, or other imaging limitations that affect the model’s accuracy. Furthermore, due to the limited amount of data, most models trained on single-modality data tend to overfit the training set and perform poorly on out-of-domain data. Therefore, in this paper, we propose a network named Cross-Modal Reasoning and Multi-Task Learning Network (CRML-Net), which combines cross-modal reasoning and multi-task learning, aiming to leverage the complementary information between different modalities and tasks to enhance the model’s generalization ability and accuracy. Specifically, CRML-Net consists of two stages. In the first stage, our network extracts a new morphological information modality from the original image and then performs cross-modal fusion with the original modality image, aiming to leverage the morphological information to enhance the model’s robustness to out-of-domain datasets. In the second stage, based on the output of the previous stage, we introduce a multi-task learning mechanism, aiming to improve the model’s performance on unseen data by sharing surface detail information from auxiliary tasks. We validated our method on a publicly available tooth cone beam computed tomography dataset. Our evaluation demonstrates that our method outperforms state-of-the-art approaches.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142148047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1