首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
Subtle signals: Video-based detection of infant non-nutritive sucking as a neurodevelopmental cue 微妙的信号基于视频的婴儿非营养性吸吮检测是一种神经发育线索
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-19 DOI: 10.1016/j.cviu.2024.104081

Non-nutritive sucking (NNS), which refers to the act of sucking on a pacifier, finger, or similar object without nutrient intake, plays a crucial role in assessing healthy early development. In the case of preterm infants, NNS behavior is a key component in determining their readiness for feeding. In older infants, the characteristics of NNS behavior offer valuable insights into neural and motor development. Additionally, NNS activity has been proposed as a potential safeguard against sudden infant death syndrome (SIDS). However, the clinical application of NNS assessment is currently hindered by labor-intensive and subjective finger-in-mouth evaluations. Consequently, researchers often resort to expensive pressure transducers for objective NNS signal measurement. To enhance the accessibility and reliability of NNS signal monitoring for both clinicians and researchers, we introduce a vision-based algorithm designed for non-contact detection of NNS activity using baby monitor footage in natural settings. Our approach involves a comprehensive exploration of optical flow and temporal convolutional networks, enabling the detection and amplification of subtle infant-sucking signals. We successfully classify short video clips of uniform length into NNS and non-NNS periods. Furthermore, we investigate manual and learning-based techniques to piece together local classification results, facilitating the segmentation of longer mixed-activity videos into NNS and non-NNS segments of varying duration. Our research introduces two novel datasets of annotated infant videos, including one sourced from our clinical study featuring 18 infant subjects and 183 h of overnight baby monitor footage. Additionally, we incorporate a second, shorter dataset obtained from publicly available YouTube videos. Our NNS action recognition algorithm achieves an impressive 95.8% accuracy in binary classification, based on 960 2.5-s balanced NNS versus non-NNS clips from our clinical dataset. We also present results for a subset of clips featuring challenging video conditions. Moreover, our NNS action segmentation algorithm achieves an average precision of 93.5% and an average recall of 92.9% across 30 heterogeneous 60-s clips from our clinical dataset.

非营养性吸吮(NNS)是指在不摄入营养的情况下吸吮奶嘴、手指或类似物体的行为,在评估婴儿早期健康发育方面起着至关重要的作用。对于早产儿,NNS 行为是决定其是否准备好喂养的关键因素。对于年龄较大的婴儿,NNS 行为的特征为了解其神经和运动发育提供了宝贵的信息。此外,NNS 活动还被认为是预防婴儿猝死综合症(SIDS)的潜在保障。然而,NNS 评估的临床应用目前还受到费力且主观的手指放入口中评估的阻碍。因此,研究人员通常采用昂贵的压力传感器来进行客观的 NNS 信号测量。为了提高临床医生和研究人员对 NNS 信号监测的可及性和可靠性,我们介绍了一种基于视觉的算法,旨在利用自然环境中的婴儿监视器录像对 NNS 活动进行非接触式检测。我们的方法涉及对光流和时序卷积网络的全面探索,从而能够检测和放大细微的婴儿吸吮信号。我们成功地将长度一致的短视频片段分为 NNS 和非 NNS 期。此外,我们还研究了人工和基于学习的技术来拼凑局部分类结果,从而促进了将较长的混合活动视频分割为不同持续时间的 NNS 和非 NNS 片段。我们的研究引入了两个新的婴儿视频注释数据集,其中一个数据集来自我们的临床研究,包含 18 个婴儿受试者和 183 小时的通宵婴儿监视器录像。此外,我们还加入了第二个较短的数据集,该数据集来自公开的 YouTube 视频。我们的 NNS 动作识别算法基于临床数据集中的 960 个 2.5 秒平衡 NNS 与非 NNS 片段,在二元分类中取得了令人印象深刻的 95.8% 的准确率。我们还展示了具有挑战性视频条件的剪辑子集的结果。此外,我们的 NNS 动作分割算法在临床数据集中的 30 个 60 秒异构片段中取得了 93.5% 的平均精确度和 92.9% 的平均召回率。
{"title":"Subtle signals: Video-based detection of infant non-nutritive sucking as a neurodevelopmental cue","authors":"","doi":"10.1016/j.cviu.2024.104081","DOIUrl":"10.1016/j.cviu.2024.104081","url":null,"abstract":"<div><p>Non-nutritive sucking (NNS), which refers to the act of sucking on a pacifier, finger, or similar object without nutrient intake, plays a crucial role in assessing healthy early development. In the case of preterm infants, NNS behavior is a key component in determining their readiness for feeding. In older infants, the characteristics of NNS behavior offer valuable insights into neural and motor development. Additionally, NNS activity has been proposed as a potential safeguard against sudden infant death syndrome (SIDS). However, the clinical application of NNS assessment is currently hindered by labor-intensive and subjective finger-in-mouth evaluations. Consequently, researchers often resort to expensive pressure transducers for objective NNS signal measurement. To enhance the accessibility and reliability of NNS signal monitoring for both clinicians and researchers, we introduce a vision-based algorithm designed for non-contact detection of NNS activity using baby monitor footage in natural settings. Our approach involves a comprehensive exploration of optical flow and temporal convolutional networks, enabling the detection and amplification of subtle infant-sucking signals. We successfully classify short video clips of uniform length into NNS and non-NNS periods. Furthermore, we investigate manual and learning-based techniques to piece together local classification results, facilitating the segmentation of longer mixed-activity videos into NNS and non-NNS segments of varying duration. Our research introduces two novel datasets of annotated infant videos, including one sourced from our clinical study featuring 18 infant subjects and 183 h of overnight baby monitor footage. Additionally, we incorporate a second, shorter dataset obtained from publicly available YouTube videos. Our NNS action recognition algorithm achieves an impressive 95.8% accuracy in binary classification, based on 960 2.5-s balanced NNS versus non-NNS clips from our clinical dataset. We also present results for a subset of clips featuring challenging video conditions. Moreover, our NNS action segmentation algorithm achieves an average precision of 93.5% and an average recall of 92.9% across 30 heterogeneous 60-s clips from our clinical dataset.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141960522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLAFN-Generator: Learnable linear-attention with fast-normalization for large-scale image captioning LLAFN-Generator:用于大规模图像标题的可学习线性注意与快速规范化
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-18 DOI: 10.1016/j.cviu.2024.104088

Recently, although Transformer has widespread application in the field of computer vision, the quadratic complexity of its Self-Attention hindered the processing in large-scale image captioning task. Therefore, in this paper, we propose a Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning (dubbed as LLAFN-Generator). Firstly, it introduces a Learnable Linear-Attention (LLA) module to solve the weight score learning of large-scale images, which is simply implemented through two linear layers and greatly reduces the computation complexity. Meanwhile, the Fast-Normalization (FN) method is employed in the Learnable Linear-Attention instead of the original Softmax function to improve the computational speed. Additionally, the feature enhancement module be used to compensate for the shallow, fine-grained information in order to enhance the feature representation of the model. Finally, extensive experiments on the MS COCO dataset show that the computational complexity is reduced by 30% and the parameter is reduced by 20% on models of the same size, with the performance metrics BLEU_1 and CIDEr increasing by 1.2% and 3.6%, respectively.

近年来,虽然变形器在计算机视觉领域得到了广泛应用,但其自注意力的二次方复杂性阻碍了大规模图像标题任务的处理。因此,我们在本文中提出了一种用于大规模图像标题的可学习线性自注意快速归一化方法(简称 LLAFN-Generator)。首先,它引入了可学习线性注意力(LLA)模块来解决大规模图像的权重分数学习问题,该模块通过两个线性层简单实现,大大降低了计算复杂度。同时,在可学习线性注意力中采用了快速归一化(FN)方法,取代了原来的 Softmax 函数,从而提高了计算速度。此外,还使用了特征增强模块来补偿浅层、细粒度信息,以增强模型的特征表示。最后,在 MS COCO 数据集上进行的大量实验表明,在相同规模的模型上,计算复杂度降低了 30%,参数降低了 20%,性能指标 BLEU_1 和 CIDEr 分别提高了 1.2% 和 3.6%。
{"title":"LLAFN-Generator: Learnable linear-attention with fast-normalization for large-scale image captioning","authors":"","doi":"10.1016/j.cviu.2024.104088","DOIUrl":"10.1016/j.cviu.2024.104088","url":null,"abstract":"<div><p>Recently, although Transformer has widespread application in the field of computer vision, the quadratic complexity of its Self-Attention hindered the processing in large-scale image captioning task. Therefore, in this paper, we propose a Learnable Linear-Attention with Fast-Normalization for Large-Scale Image Captioning (dubbed as LLAFN-Generator). Firstly, it introduces a Learnable Linear-Attention (LLA) module to solve the weight score learning of large-scale images, which is simply implemented through two linear layers and greatly reduces the computation complexity. Meanwhile, the Fast-Normalization (FN) method is employed in the Learnable Linear-Attention instead of the original Softmax function to improve the computational speed. Additionally, the feature enhancement module be used to compensate for the shallow, fine-grained information in order to enhance the feature representation of the model. Finally, extensive experiments on the MS COCO dataset show that the computational complexity is reduced by 30% and the parameter is reduced by 20% on models of the same size, with the performance metrics BLEU_1 and CIDEr increasing by 1.2% and 3.6%, respectively.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141838960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SHOWMe: Robust object-agnostic hand-object 3D reconstruction from RGB video SHOWMe: 从 RGB 视频中重建可靠的与物体无关的手部物体 3D 模型
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-15 DOI: 10.1016/j.cviu.2024.104073

In this paper, we tackle the problem of detailed hand-object 3D reconstruction from monocular video with unknown objects, for applications where the required accuracy and level of detail is important, e.g. object hand-over in human–robot collaboration, or manipulation and contact point analysis. While the recent literature on this topic is promising, the accuracy and generalization abilities of existing methods are still lacking. This is due to several limitations, such as the assumption of known object class or model for a small number of instances, or over-reliance on off-the-shelf keypoint and structure-from-motion methods for object-relative viewpoint estimation, prone to complete failure with previously unobserved, poorly textured objects or hand-object occlusions. To address previous method shortcomings, we present a 2-stage pipeline superseding state-of-the-art (SotA) performance on several metrics. First, we robustly retrieve viewpoints relying on a learned pairwise camera pose estimator trainable with a low data regime, followed by a globalized Shonan pose averaging. Second, we simultaneously estimate detailed 3D hand-object shapes and refine camera poses using a differential renderer-based optimizer. To better assess the out-of-distribution abilities of existing methods, and to showcase our methodological contributions, we introduce the new SHOWMe benchmark dataset with 96 sequences annotated with poses, millimetric textured 3D shape scans, and parametric hand models, introducing new object and hand diversity. Remarkably, we show that our method is able to reconstruct 100% of these sequences as opposed to SotA Structure-from-Motion (SfM) or hand-keypoint-based pipelines, and obtains reconstructions of equivalent or better precision when existing methods do succeed in providing a result. We hope these contributions lead to further research under harder input assumptions. The dataset can be downloaded at https://download.europe.naverlabs.com/showme.

在本文中,我们探讨了从未知物体的单目视频中重建手部物体三维细节的问题,该问题适用于对精确度和细节水平要求较高的应用,例如人机协作中的物体交接,或操纵和接触点分析。虽然最近有关这一主题的文献很有前景,但现有方法的准确性和泛化能力仍然不足。究其原因,主要有以下几个方面的局限性:假设少量实例的对象类别或模型为已知;过度依赖现成的关键点和运动结构方法来进行对象相关视点估算,而这些方法容易在先前未观察到、纹理不清晰或手部对象遮挡的情况下完全失效。为了解决以往方法的不足,我们提出了一种在多个指标上超越最先进(SotA)性能的两阶段管道。首先,我们利用学习到的可在低数据机制下训练的成对相机姿态估计器来稳健地检索视点,然后进行全局化的湘南姿态平均。其次,我们同时使用基于差分渲染器的优化器估算详细的三维手部物体形状并完善摄像机姿势。为了更好地评估现有方法的非分布能力,并展示我们在方法论上的贡献,我们引入了新的 SHOWMe 基准数据集,该数据集包含 96 个注释了姿势、毫米级纹理三维形状扫描和参数手部模型的序列,引入了新的物体和手部多样性。值得注意的是,我们的研究表明,与基于运动结构(SotA Structure-from-Motion,SfM)或基于手部关键点的管道相比,我们的方法能够 100% 地重建这些序列,并且在现有方法成功提供结果的情况下,我们的方法还能获得精度相当或更高的重建结果。我们希望这些贡献能促进在更困难的输入假设条件下的进一步研究。数据集可从 https://download.europe.naverlabs.com/showme 下载。
{"title":"SHOWMe: Robust object-agnostic hand-object 3D reconstruction from RGB video","authors":"","doi":"10.1016/j.cviu.2024.104073","DOIUrl":"10.1016/j.cviu.2024.104073","url":null,"abstract":"<div><p>In this paper, we tackle the problem of detailed hand-object 3D reconstruction from monocular video with <em>unknown objects</em>, for applications where the required accuracy and level of detail is important, e.g. object hand-over in human–robot collaboration, or manipulation and contact point analysis. While the recent literature on this topic is promising, the accuracy and generalization abilities of existing methods are still lacking. This is due to several limitations, such as the assumption of known object class or model for a small number of instances, or over-reliance on off-the-shelf keypoint and structure-from-motion methods for object-relative viewpoint estimation, prone to complete failure with previously unobserved, poorly textured objects or hand-object occlusions. To address previous method shortcomings, we present a 2-stage pipeline superseding state-of-the-art (SotA) performance on several metrics. First, we robustly retrieve viewpoints relying on a learned pairwise camera pose estimator trainable with a low data regime, followed by a globalized Shonan pose averaging. Second, we simultaneously estimate detailed 3D hand-object shapes and refine camera poses using a differential renderer-based optimizer. To better assess the out-of-distribution abilities of existing methods, and to showcase our methodological contributions, we introduce the new SHOWMe benchmark dataset with 96 sequences annotated with poses, millimetric textured 3D shape scans, and parametric hand models, introducing new object and hand diversity. Remarkably, we show that our method is able to reconstruct 100% of these sequences as opposed to SotA Structure-from-Motion (SfM) or hand-keypoint-based pipelines, and obtains reconstructions of equivalent or better precision when existing methods do succeed in providing a result. We hope these contributions lead to further research under harder input assumptions. The dataset can be downloaded at <span><span>https://download.europe.naverlabs.com/showme</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141693224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artifact feature purification for cross-domain detection of AI-generated images 人工智能生成图像跨域检测的伪特征纯化
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-14 DOI: 10.1016/j.cviu.2024.104078

In the era of AIGC, the fast development of visual content generation technologies, such as diffusion models, brings potential security risks to our society. Existing generated image detection methods suffer from performance drops when faced with out-of-domain generators and image scenes. To relieve this problem, we propose Artifact Purification Network (APN) to facilitate the artifact extraction from generated images through the explicit and implicit purification processes. For the explicit one, a suspicious frequency-band proposal method and a spatial feature decomposition method are proposed to extract artifact-related features. For the implicit one, a training strategy based on mutual information estimation is proposed to further purify the artifact-related features. The experiments are conducted in two settings. Firstly, we perform a cross-generator evaluation, wherein detectors trained using data from one generator are evaluated on data generated by other generators. Secondly, we conduct a cross-scene evaluation, wherein detectors trained for a specific domain of content (e.g., ImageNet) are assessed on data collected from another domain (e.g., LSUN-Bedroom). Results show that for cross-generator detection, the average accuracy of APN is 5.6%16.4% higher than the previous 11 methods on the GenImage dataset and 1.7%50.1% on the DiffusionForensics dataset. For cross-scene detection, APN maintains its high performance. Via visualization analysis, we find that the proposed method can extract diverse forgery patterns and condense the forgery information diluted in irrelated features. We also find that the artifact features APN focuses on across generators and scenes are global and diverse. The code will be available at https://github.com/RichardSunnyMeng/APN-official-codes.

在 AIGC 时代,扩散模型等视觉内容生成技术的快速发展给我们的社会带来了潜在的安全风险。现有的生成图像检测方法在面对域外生成器和图像场景时性能下降。为了解决这一问题,我们提出了人工制品净化网络(APN),通过显式和隐式净化过程,促进从生成图像中提取人工制品。在显式净化过程中,我们提出了一种可疑频带提议方法和一种空间特征分解方法来提取与人工制品相关的特征。对于隐式净化,则提出了一种基于互信息估计的训练策略,以进一步净化与人工制品相关的特征。实验在两种情况下进行。首先,我们进行了跨生成器评估,即使用一个生成器的数据训练的检测器在其他生成器生成的数据上进行评估。其次,我们进行了跨场景评估,即针对特定内容领域(如 ImageNet)训练的检测器在另一领域(如 LSUN-Bedroom)收集的数据上进行评估。结果显示,在 GenImage 数据集和 DiffusionForensics 数据集上,APN 的跨生成器检测平均准确率高于前 11 种方法。在跨场景检测方面,APN 保持了较高的性能。通过可视化分析,我们发现所提出的方法可以提取多种伪造模式,并将稀释的伪造信息浓缩在不相关的特征中。我们还发现,APN 在不同生成器和场景中关注的伪造特征是全局性的、多样化的。代码可在 .NET Framework 3.0 上获取。
{"title":"Artifact feature purification for cross-domain detection of AI-generated images","authors":"","doi":"10.1016/j.cviu.2024.104078","DOIUrl":"10.1016/j.cviu.2024.104078","url":null,"abstract":"<div><p>In the era of AIGC, the fast development of visual content generation technologies, such as diffusion models, brings potential security risks to our society. Existing generated image detection methods suffer from performance drops when faced with out-of-domain generators and image scenes. To relieve this problem, we propose Artifact Purification Network (APN) to facilitate the artifact extraction from generated images through the explicit and implicit purification processes. For the explicit one, a suspicious frequency-band proposal method and a spatial feature decomposition method are proposed to extract artifact-related features. For the implicit one, a training strategy based on mutual information estimation is proposed to further purify the artifact-related features. The experiments are conducted in two settings. Firstly, we perform a cross-generator evaluation, wherein detectors trained using data from one generator are evaluated on data generated by other generators. Secondly, we conduct a cross-scene evaluation, wherein detectors trained for a specific domain of content (e.g., ImageNet) are assessed on data collected from another domain (e.g., LSUN-Bedroom). Results show that for cross-generator detection, the average accuracy of APN is <span><math><mrow><mn>5</mn><mo>.</mo><mn>6</mn><mtext>%</mtext><mo>∼</mo><mn>16</mn><mo>.</mo><mn>4</mn><mtext>%</mtext></mrow></math></span> higher than the previous 11 methods on the GenImage dataset and <span><math><mrow><mn>1</mn><mo>.</mo><mn>7</mn><mtext>%</mtext><mo>∼</mo><mn>50</mn><mo>.</mo><mn>1</mn><mtext>%</mtext></mrow></math></span> on the DiffusionForensics dataset. For cross-scene detection, APN maintains its high performance. Via visualization analysis, we find that the proposed method can extract diverse forgery patterns and condense the forgery information diluted in irrelated features. We also find that the artifact features APN focuses on across generators and scenes are global and diverse. The code will be available at <span><span>https://github.com/RichardSunnyMeng/APN-official-codes</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141936012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing Image Generation with Denoising Diffusion Probabilistic Model and ConvNeXt-V2: A novel approach for enhanced diversity and quality 利用去噪扩散概率模型和 ConvNeXt-V2 推动图像生成:提高多样性和质量的新方法
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-14 DOI: 10.1016/j.cviu.2024.104077

In the rapidly evolving domain of image generation, the availability of sufficient data is crucial for effective model training. However, obtaining a large dataset is often challenging. Medical imaging, industrial monitoring, and self-driving cars are among the applications that require high-fidelity image generation from limited or single data points. The paper proposes a novel approach for increasing the diversity of images generated from a single input image by combining a Denoising Diffusion Probabilistic Model (DDPM) with the ConvNeXt-V2 architecture. This technique addresses the issue of limited data availability by utilizing single images using the BSD and Places365 datasets, significantly increasing the ability of the model through different conditions. The research greatly enhances the image quality by including Global Response Normalization (GRN) and Sigmoid-Weighted Linear Units (SiLU) in the DDPM. In-depth analyses and comparisons with the existing State-of-the-art (SOTA) models highlight the model’s effectiveness, which shows higher experimental results. Achievements include a Pixel Diversity score of 0.87±0.1, an LPIPS Diversity score of 0.42±0.03, and a SIFID for Patch Distribution of 0.046±0.02, along with notable NIQE and RECO scores. These findings indicate the exceptional ability of the model to generate a wide range of high-quality images, exhibiting significant advancement over existing State-of-the-art models in the field of image generation.

在快速发展的图像生成领域,充足的数据对于有效的模型训练至关重要。然而,获取大型数据集往往具有挑战性。医学成像、工业监控和自动驾驶汽车等应用都需要从有限或单一的数据点生成高保真图像。本文提出了一种新方法,通过将去噪扩散概率模型(DDPM)与 ConvNeXt-V2 架构相结合,增加从单一输入图像生成的图像的多样性。该技术利用 BSD 和 Places365 数据集的单张图像解决了数据可用性有限的问题,大大提高了模型在不同条件下的能力。研究通过在 DDPM 中加入全局响应归一化(GRN)和西格玛加权线性单位(SiLU),大大提高了图像质量。深入分析并与现有的最先进(SOTA)模型进行比较,凸显了该模型的有效性,并显示出更高的实验结果。所取得的成绩包括像素多样性得分(0.87±0.1)、LPIPS 多样性得分(0.42±0.03)和补丁分布的 SIFID(0.046±0.02),以及显著的 NIQE 和 RECO 分数。这些研究结果表明,该模型具有生成各种高质量图像的卓越能力,与图像生成领域现有的最先进模型相比具有显著进步。
{"title":"Advancing Image Generation with Denoising Diffusion Probabilistic Model and ConvNeXt-V2: A novel approach for enhanced diversity and quality","authors":"","doi":"10.1016/j.cviu.2024.104077","DOIUrl":"10.1016/j.cviu.2024.104077","url":null,"abstract":"<div><p>In the rapidly evolving domain of image generation, the availability of sufficient data is crucial for effective model training. However, obtaining a large dataset is often challenging. Medical imaging, industrial monitoring, and self-driving cars are among the applications that require high-fidelity image generation from limited or single data points. The paper proposes a novel approach for increasing the diversity of images generated from a single input image by combining a Denoising Diffusion Probabilistic Model (DDPM) with the ConvNeXt-V2 architecture. This technique addresses the issue of limited data availability by utilizing single images using the BSD and Places365 datasets, significantly increasing the ability of the model through different conditions. The research greatly enhances the image quality by including Global Response Normalization (GRN) and Sigmoid-Weighted Linear Units (SiLU) in the DDPM. In-depth analyses and comparisons with the existing State-of-the-art (SOTA) models highlight the model’s effectiveness, which shows higher experimental results. Achievements include a Pixel Diversity score of 0.87±0.1, an LPIPS Diversity score of 0.42±0.03, and a SIFID for Patch Distribution of 0.046±0.02, along with notable NIQE and RECO scores. These findings indicate the exceptional ability of the model to generate a wide range of high-quality images, exhibiting significant advancement over existing State-of-the-art models in the field of image generation.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141701363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EnsCLR: Unsupervised skeleton-based action recognition via ensemble contrastive learning of representation EnsCLR:通过表征的集合对比学习实现基于骨骼的无监督动作识别
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-14 DOI: 10.1016/j.cviu.2024.104076

Skeleton-based action recognition is a key research area in video understanding, beneficial from its compact and efficient motion information. To relieve from the burden of expensive and laborious data annotation, unsupervised approaches, particularly contrastive learning, have been widely employed to extract action representations from unlabeled data. In this paper, we propose an Ensemble framework for Contrastive Learning of Representation (EnsCLR) to preform unsupervised skeleton-based action recognition. Concretely, Queue Extension method is devised to generate discriminative representation by aggregating the ensemble information from multiple pipelines. Furtherly, Ensemble Nearest Neighbors Mining (ENNM) method is utilized to excavate the most similar samples from the unlabeled data as positive samples, which alleviates the false-negative samples problem caused by the disregard of category label. The experiments with extensive evaluation protocols show that EnsCLR outperforms previous state-of-the-art methods on NTU60, NTU120, and PKU-MMD datasets.

基于骨架的动作识别是视频理解领域的一个关键研究领域,因为它具有紧凑、高效的运动信息。为了减轻昂贵而费力的数据标注负担,无监督方法,尤其是对比学习,已被广泛用于从无标记数据中提取动作表征。在本文中,我们提出了一种对比学习表征的集合框架(EnsCLR)来预先执行基于骨架的无监督动作识别。具体来说,我们设计了队列扩展方法,通过聚合来自多条管道的集合信息来生成判别表征。此外,利用集合近邻挖掘(ENNM)方法,从无标签数据中挖掘出最相似的样本作为正样本,从而缓解了因忽略类别标签而导致的假阴性样本问题。大量评估协议实验表明,EnsCLR 在 NTU60、NTU120 和 PKU-MMD 数据集上的表现优于之前的先进方法。
{"title":"EnsCLR: Unsupervised skeleton-based action recognition via ensemble contrastive learning of representation","authors":"","doi":"10.1016/j.cviu.2024.104076","DOIUrl":"10.1016/j.cviu.2024.104076","url":null,"abstract":"<div><p>Skeleton-based action recognition is a key research area in video understanding, beneficial from its compact and efficient motion information. To relieve from the burden of expensive and laborious data annotation, unsupervised approaches, particularly contrastive learning, have been widely employed to extract action representations from unlabeled data. In this paper, we propose an Ensemble framework for Contrastive Learning of Representation (EnsCLR) to preform unsupervised skeleton-based action recognition. Concretely, Queue Extension method is devised to generate discriminative representation by aggregating the ensemble information from multiple pipelines. Furtherly, Ensemble Nearest Neighbors Mining (ENNM) method is utilized to excavate the most similar samples from the unlabeled data as positive samples, which alleviates the false-negative samples problem caused by the disregard of category label. The experiments with extensive evaluation protocols show that EnsCLR outperforms previous state-of-the-art methods on NTU60, NTU120, and PKU-MMD datasets.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141630735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightning fast video anomaly detection via multi-scale adversarial distillation 通过多尺度对抗提炼实现快速视频异常检测
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-14 DOI: 10.1016/j.cviu.2024.104074

We propose a very fast frame-level model for anomaly detection in video, which learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models. To improve the fidelity of our student, we distill the low-resolution anomaly maps of the teachers by jointly applying standard and adversarial distillation, introducing an adversarial discriminator for each teacher to distinguish between target and generated anomaly maps. We conduct experiments on three benchmarks (Avenue, ShanghaiTech, UCSD Ped2), showing that our method is over 7 times faster than the fastest competing method, and between 28 and 62 times faster than object-centric models, while obtaining comparable results to recent methods. Our evaluation also indicates that our model achieves the best trade-off between speed and accuracy, due to its previously unheard-of speed of 1480 FPS. In addition, we carry out a comprehensive ablation study to justify our architectural design choices. Our code is freely available at: https://github.com/ristea/fast-aed.

我们提出了一种用于视频异常检测的超快帧级模型,该模型通过从多个高精度对象级教师模型中提炼知识来学习检测异常。为了提高学生的保真度,我们通过联合应用标准和对抗性蒸馏来蒸馏教师的低分辨率异常图,并为每个教师引入一个对抗性判别器,以区分目标异常图和生成异常图。我们在三个基准(Avenue、ShanghaiTech、UCSD Ped2)上进行了实验,结果表明我们的方法比最快的竞争方法快 7 倍以上,比以对象为中心的模型快 28 到 62 倍,同时获得了与最新方法相当的结果。我们的评估还表明,我们的模型在速度和准确性之间实现了最佳权衡,因为它的速度达到了前所未有的 1480 FPS。此外,我们还进行了全面的消融研究,以证明我们的架构设计选择是正确的。我们的代码可在以下网址免费获取:https://github.com/ristea/fast-aed。
{"title":"Lightning fast video anomaly detection via multi-scale adversarial distillation","authors":"","doi":"10.1016/j.cviu.2024.104074","DOIUrl":"10.1016/j.cviu.2024.104074","url":null,"abstract":"<div><p>We propose a very fast frame-level model for anomaly detection in video, which learns to detect anomalies by distilling knowledge from multiple highly accurate object-level teacher models. To improve the fidelity of our student, we distill the low-resolution anomaly maps of the teachers by jointly applying standard and adversarial distillation, introducing an adversarial discriminator for each teacher to distinguish between target and generated anomaly maps. We conduct experiments on three benchmarks (Avenue, ShanghaiTech, UCSD Ped2), showing that our method is over 7 times faster than the fastest competing method, and between 28 and 62 times faster than object-centric models, while obtaining comparable results to recent methods. Our evaluation also indicates that our model achieves the best trade-off between speed and accuracy, due to its previously unheard-of speed of 1480 FPS. In addition, we carry out a comprehensive ablation study to justify our architectural design choices. Our code is freely available at: <span><span>https://github.com/ristea/fast-aed</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1077314224001553/pdfft?md5=5d1911b5187bd230e9204ffb31d8102f&pid=1-s2.0-S1077314224001553-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141732379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Low-light image enhancement based on cell vibration energy model and lightness difference 基于细胞振动能量模型和亮度差异的弱光图像增强技术
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-14 DOI: 10.1016/j.cviu.2024.104079

Low-light image enhancement algorithms play a crucial role in revealing details obscured by darkness in images and substantially improving overall image quality. However, existing methods often suffer from issues like color or lightness distortion and possess limited scalability. In response to these challenges, we introduce a novel low-light image enhancement algorithm leveraging a cell vibration energy model and lightness difference. Initially, a new low-light image enhancement framework is proposed, building upon a comprehensive understanding and analysis of the cell vibration energy model and its statistical properties. Subsequently, to achieve pixel-level multi-lightness difference adjustment and exert control over the lightness level of each pixel independently, a lightness difference adjustment strategy is introduced utilizing Weibull distribution and linear mapping. Furthermore, to expand the adaptive range of the algorithm, we consider the disparities between HSV space and RGB space. Two enhanced image output modes are designed, accompanied by a thorough analysis and deduction of the relevant image layer mapping formulas. Finally, to enhance the reliability of experimental results, certain image faults in the SICE database are rectified using the feature matching method. Experimental results showcase the superiority of the proposed algorithm over twelve state-of-the-art algorithms. The resource code of this article will be released at https://github.com/leixiaozhou/CDEGmethod.

低照度图像增强算法在揭示图像中被黑暗遮挡的细节和大幅提高整体图像质量方面发挥着至关重要的作用。然而,现有方法往往存在色彩或亮度失真等问题,而且可扩展性有限。为了应对这些挑战,我们引入了一种利用细胞振动能量模型和亮度差异的新型弱光图像增强算法。首先,在全面了解和分析细胞振动能量模型及其统计特性的基础上,我们提出了一种新的弱光图像增强框架。随后,为了实现像素级多亮度差调整,并独立控制每个像素的亮度等级,引入了利用威布尔分布和线性映射的亮度差调整策略。此外,为了扩大算法的自适应范围,我们考虑了 HSV 空间和 RGB 空间之间的差异。设计了两种增强型图像输出模式,并对相关图像层映射公式进行了深入分析和推导。最后,为了提高实验结果的可靠性,使用特征匹配方法修正了 SICE 数据库中的某些图像缺陷。实验结果表明,所提出的算法优于十二种最先进的算法。本文的资源代码将在 https://github.com/leixiaozhou/CDEGmethod 上发布。
{"title":"Low-light image enhancement based on cell vibration energy model and lightness difference","authors":"","doi":"10.1016/j.cviu.2024.104079","DOIUrl":"10.1016/j.cviu.2024.104079","url":null,"abstract":"<div><p>Low-light image enhancement algorithms play a crucial role in revealing details obscured by darkness in images and substantially improving overall image quality. However, existing methods often suffer from issues like color or lightness distortion and possess limited scalability. In response to these challenges, we introduce a novel low-light image enhancement algorithm leveraging a cell vibration energy model and lightness difference. Initially, a new low-light image enhancement framework is proposed, building upon a comprehensive understanding and analysis of the cell vibration energy model and its statistical properties. Subsequently, to achieve pixel-level multi-lightness difference adjustment and exert control over the lightness level of each pixel independently, a lightness difference adjustment strategy is introduced utilizing Weibull distribution and linear mapping. Furthermore, to expand the adaptive range of the algorithm, we consider the disparities between HSV space and RGB space. Two enhanced image output modes are designed, accompanied by a thorough analysis and deduction of the relevant image layer mapping formulas. Finally, to enhance the reliability of experimental results, certain image faults in the SICE database are rectified using the feature matching method. Experimental results showcase the superiority of the proposed algorithm over twelve state-of-the-art algorithms. The resource code of this article will be released at <span><span>https://github.com/leixiaozhou/CDEGmethod</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-domain awareness for compressed deepfake videos detection over social networks guided by common mechanisms between artifacts 以人工制品之间的共同机制为指导,利用多域感知技术检测社交网络上的压缩深度伪造视频
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-10 DOI: 10.1016/j.cviu.2024.104072

The viral spread of massive deepfake videos over social networks has caused serious security problems. Despite the remarkable advancements achieved by existing deepfake detection algorithms, deepfake videos over social networks are inevitably influenced by compression factors. This causes deepfake detection performance to be limited by the following challenging issues: (a) interfering with compression artifacts, (b) loss of feature information, and (c) aliasing of feature distributions. In this paper, we analyze the common mechanism between compression artifacts and deepfake artifacts, revealing the structural similarity between them and providing a reliable theoretical basis for enhancing the robustness of deepfake detection models against compression. Firstly, based on the common mechanism between artifacts, we design a frequency domain adaptive notch filter to eliminate the interference of compression artifacts on specific frequency bands. Secondly, to reduce the sensitivity of deepfake detection models to unknown noise, we propose a spatial residual denoising strategy. Thirdly, to exploit the intrinsic correlation between feature vectors in the frequency domain branch and the spatial domain branch, we enhance deepfake features using an attention-based feature fusion method. Finally, we adopt a multi-task decision approach to enhance the discriminative power of the latent space representation of deepfakes, achieving deepfake detection with robustness against compression. Extensive experiments show that compared with the baseline methods, the detection performance of the proposed algorithm on compressed deepfake videos has been significantly improved. In particular, our model is resistant to various types of noise disturbances and can be easily combined with baseline detection models to improve their robustness.

社交网络上大量深度伪造视频的病毒式传播引发了严重的安全问题。尽管现有的深度伪造检测算法取得了显著进步,但社交网络上的深度伪造视频不可避免地受到压缩因素的影响。这导致深度伪造检测性能受到以下挑战性问题的限制:(a)压缩伪影的干扰,(b)特征信息的丢失,以及(c)特征分布的混叠。本文分析了压缩伪影与深度伪影的共同机理,揭示了二者在结构上的相似性,为增强深度伪影检测模型对抗压缩的鲁棒性提供了可靠的理论依据。首先,基于伪影之间的共同机制,我们设计了一种频域自适应陷波滤波器来消除压缩伪影对特定频段的干扰。其次,为了降低深度伪造检测模型对未知噪声的敏感性,我们提出了一种空间残差去噪策略。第三,为了利用频域分支和空间域分支中特征向量之间的内在相关性,我们采用了一种基于注意力的特征融合方法来增强深度伪造特征。最后,我们采用了一种多任务决策方法来增强潜空间表征对深度伪造的判别能力,从而实现了具有抗压缩鲁棒性的深度伪造检测。大量实验表明,与基线方法相比,所提算法在压缩 deepfake 视频上的检测性能有了显著提高。特别是,我们的模型能抵御各种类型的噪声干扰,并能轻松地与基线检测模型相结合,以提高其鲁棒性。
{"title":"Multi-domain awareness for compressed deepfake videos detection over social networks guided by common mechanisms between artifacts","authors":"","doi":"10.1016/j.cviu.2024.104072","DOIUrl":"10.1016/j.cviu.2024.104072","url":null,"abstract":"<div><p>The viral spread of massive deepfake videos over social networks has caused serious security problems. Despite the remarkable advancements achieved by existing deepfake detection algorithms, deepfake videos over social networks are inevitably influenced by compression factors. This causes deepfake detection performance to be limited by the following challenging issues: (a) interfering with compression artifacts, (b) loss of feature information, and (c) aliasing of feature distributions. In this paper, we analyze the common mechanism between compression artifacts and deepfake artifacts, revealing the structural similarity between them and providing a reliable theoretical basis for enhancing the robustness of deepfake detection models against compression. Firstly, based on the common mechanism between artifacts, we design a frequency domain adaptive notch filter to eliminate the interference of compression artifacts on specific frequency bands. Secondly, to reduce the sensitivity of deepfake detection models to unknown noise, we propose a spatial residual denoising strategy. Thirdly, to exploit the intrinsic correlation between feature vectors in the frequency domain branch and the spatial domain branch, we enhance deepfake features using an attention-based feature fusion method. Finally, we adopt a multi-task decision approach to enhance the discriminative power of the latent space representation of deepfakes, achieving deepfake detection with robustness against compression. Extensive experiments show that compared with the baseline methods, the detection performance of the proposed algorithm on compressed deepfake videos has been significantly improved. In particular, our model is resistant to various types of noise disturbances and can be easily combined with baseline detection models to improve their robustness.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141715999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval 跨模态食物检索的视觉和结构化语言预训练
IF 4.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-09 DOI: 10.1016/j.cviu.2024.104071

Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook, first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the structured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. We conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. Finally, we validate the generalization of the approach to other tasks (i.e, Food Recognition) and domains with structured text such as the Medical domain on the ROCO dataset. The code will be made publicly available.

视觉语言预训练(VLP)和基础模型一直是在一般基准上实现 SoTA 性能的常用方法。然而,利用这些强大的技术来完成更复杂的视觉语言任务(如烹饪应用)以及结构化程度更高的输入数据的研究仍然很少。在这项工作中,我们建议将这些技术用于基于结构化文本的计算烹饪任务。我们的策略被称为 VLPCook,首先将现有的图像-文本对转换为图像和结构化文本对。这样,我们就可以使用 VLP 目标对 VLPCook 模型进行预训练,以适应由此产生的数据集的结构化数据,然后在下游计算烹饪任务中对其进行微调。在微调过程中,我们还丰富了视觉编码器,利用预训练的基础模型(如 CLIP)提供局部和全局文本上下文。在大型 Recipe1M 数据集的跨模态食物检索任务中,VLPCook 的表现明显优于当前的 SoTA(+3.3 Recall@1 absolute improvement)。我们对 VLP 进行了进一步实验,以验证其重要性,尤其是在 Recipe1M+ 数据集上。最后,我们在 ROCO 数据集上验证了该方法在其他任务(即食品识别)和具有结构化文本的领域(如医疗领域)中的通用性。代码将公开发布。
{"title":"Vision and Structured-Language Pretraining for Cross-Modal Food Retrieval","authors":"","doi":"10.1016/j.cviu.2024.104071","DOIUrl":"10.1016/j.cviu.2024.104071","url":null,"abstract":"<div><p>Vision-Language Pretraining (VLP) and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks. However, leveraging these powerful techniques for more complex vision-language tasks, such as cooking applications, with more structured input data, is still little investigated. In this work, we propose to leverage these techniques for structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook, first transforms existing image-text pairs to image and structured-text pairs. This allows to pretrain our VLPCook model using VLP objectives adapted to the structured data of the resulting datasets, then finetuning it on downstream computational cooking tasks. During finetuning, we also enrich the visual encoder, leveraging pretrained foundation models (<em>e.g.</em> CLIP) to provide local and global textual context. VLPCook outperforms current SoTA by a significant margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food Retrieval on the large Recipe1M dataset. We conduct further experiments on VLP to validate their importance, especially on the Recipe1M+ dataset. Finally, we validate the generalization of the approach to other tasks (<em>i.e</em>, Food Recognition) and domains with structured text such as the Medical domain on the ROCO dataset. The code will be made publicly available.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141962288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1