首页 > 最新文献

International Journal of Computer Vision最新文献

英文 中文
AutoStory: Generating Diverse Storytelling Images with Minimal Human Efforts AutoStory:用最少的人力产生不同的故事图像
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-23 DOI: 10.1007/s11263-024-02309-y
Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen

Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches, and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images. This allows our method to obtain consistent story visualization even when only texts are provided as input. Both qualitative and quantitative experiments demonstrate the superiority of our method.

故事可视化旨在生成一系列与文本描述的故事相匹配的图像,要求生成的图像质量高、与文本描述一致、人物身份一致。考虑到故事可视化的复杂性,现有的方法通过只考虑几个特定的角色和场景,或者要求用户提供每个图像的控制条件(如草图),大大简化了问题。然而,这些简化使得这些方法不适合实际应用。为此,我们提出了一个自动化的故事可视化系统,该系统可以有效地生成多样化、高质量和一致的故事图像集,而人工交互最少。具体来说,我们利用大型语言模型的理解和规划能力来进行布局规划,然后利用大规模的文本到图像模型来生成基于布局的复杂故事图像。我们的经验发现,稀疏控制条件(如边界框)适用于布局规划,而密集控制条件(如草图和关键点)适用于生成高质量的图像内容。为了获得两全其美,我们设计了一个密集的条件生成模块,将简单的边界框布局转换为最终图像生成的草图或关键点控制条件,不仅提高了图像质量,而且允许简单直观的用户交互。此外,我们提出了一种简单而有效的方法来生成多视图一致的字符图像,消除了对人工采集或绘制字符图像的依赖。这允许我们的方法获得一致的故事可视化,即使只提供文本作为输入。定性和定量实验均证明了该方法的优越性。
{"title":"AutoStory: Generating Diverse Storytelling Images with Minimal Human Efforts","authors":"Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen","doi":"10.1007/s11263-024-02309-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02309-y","url":null,"abstract":"<p>Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, <i>e.g.</i>, sketches, and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images. This allows our method to obtain consistent story visualization even when only texts are provided as input. Both qualitative and quantitative experiments demonstrate the superiority of our method.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142874204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Noise-Resistant Multimodal Transformer for Emotion Recognition 用于情绪识别的抗噪声多模态变压器
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-22 DOI: 10.1007/s11263-024-02304-3
Yuanyuan Liu, Haoyu Zhang, Yibing Zhan, Zijing Chen, Guanghao Yin, Lin Wei, Zhe Chen

Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding against noisy information. Our new pipeline, namely Noise-Resistant Multimodal Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a multimodal fusion Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn to provide a generic and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply a multimodal fusion Transformer to incorporate Multimodal Features (MFs) of multimodal inputs (serving as the key and value) based on their relations to the NRGF (serving as the query). Therefore, the possible insensitive but useful information of NRGF could be complemented by MFs that contain more details, achieving more accurate emotion understanding while maintaining robustness against noises. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, extensive experiments can demonstrate the effectiveness of the NORM-TR and the noise-aware learning scheme for dealing with both explicitly added noisy information and the normal multimodal sequence with implicit noises. On several popular multimodal datasets (e.g., MOSI, MOSEI, IEMOCAP, and RML), our NORM-TR achieves state-of-the-art performance and outperforms existing methods by a large margin, which demonstrates that the ability to resist noisy information in multimodal input is important for effective emotion recognition.

多模态情感识别从各种数据模式(如视频、文本和音频)中识别人类情感。然而,我们发现这个任务很容易受到不包含有用语义的噪声信息的影响,这些噪声信息可能出现在多模态输入序列的不同位置。为此,我们提出了一种新的范式,该范式试图在其管道中提取抗噪声特征,并引入噪声感知学习方案,以有效提高多模态情绪理解对噪声信息的鲁棒性。我们的新管道,即抗噪声多模态变压器(NORM-TR),主要介绍了用于多模态情感识别任务的抗噪声通用特征(NRGF)提取器和多模态融合变压器。特别是,我们使NRGF提取器学习提供通用的和干扰不敏感的表示,以便获得一致的和有意义的语义。此外,我们应用一个多模态融合变压器,根据它们与NRGF(作为查询)的关系,将多模态输入(作为键和值)的多模态特征(mf)合并在一起。因此,NRGF可能不敏感但有用的信息可以通过包含更多细节的mf来补充,从而在保持对噪声的鲁棒性的同时实现更准确的情绪理解。为了正确训练NORM-TR,我们提出的噪声感知学习方案通过增强对噪声的学习来弥补正常情绪识别的损失。我们的学习方案明确地在多模态输入序列的随机位置向所有模态或特定模态添加噪声。我们相应地引入两个对抗损失,以促使NRGF提取器学习提取NRGF对添加噪声的不变性,从而促进NORM-TR获得更有利的多模态情绪识别性能。在实践中,大量的实验可以证明NORM-TR和噪声感知学习方案在处理显式添加的噪声信息和带隐式噪声的正常多模态序列方面的有效性。在几个流行的多模态数据集(例如,MOSI, MOSEI, IEMOCAP和RML)上,我们的NORM-TR达到了最先进的性能,并且在很大程度上优于现有方法,这表明在多模态输入中抵抗噪声信息的能力对于有效的情绪识别很重要。
{"title":"Noise-Resistant Multimodal Transformer for Emotion Recognition","authors":"Yuanyuan Liu, Haoyu Zhang, Yibing Zhan, Zijing Chen, Guanghao Yin, Lin Wei, Zhe Chen","doi":"10.1007/s11263-024-02304-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02304-3","url":null,"abstract":"<p>Multimodal emotion recognition identifies human emotions from various data modalities like video, text, and audio. However, we found that this task can be easily affected by noisy information that does not contain useful semantics and may occur at different locations of a multimodal input sequence. To this end, we present a novel paradigm that attempts to extract noise-resistant features in its pipeline and introduces a noise-aware learning scheme to effectively improve the robustness of multimodal emotion understanding against noisy information. Our new pipeline, namely Noise-Resistant Multimodal Transformer (NORM-TR), mainly introduces a Noise-Resistant Generic Feature (NRGF) extractor and a multimodal fusion Transformer for the multimodal emotion recognition task. In particular, we make the NRGF extractor learn to provide a generic and disturbance-insensitive representation so that consistent and meaningful semantics can be obtained. Furthermore, we apply a multimodal fusion Transformer to incorporate Multimodal Features (MFs) of multimodal inputs (serving as the key and value) based on their relations to the NRGF (serving as the query). Therefore, the possible insensitive but useful information of NRGF could be complemented by MFs that contain more details, achieving more accurate emotion understanding while maintaining robustness against noises. To train the NORM-TR properly, our proposed noise-aware learning scheme complements normal emotion recognition losses by enhancing the learning against noises. Our learning scheme explicitly adds noises to either all the modalities or a specific modality at random locations of a multimodal input sequence. We correspondingly introduce two adversarial losses to encourage the NRGF extractor to learn to extract the NRGFs invariant to the added noises, thus facilitating the NORM-TR to achieve more favorable multimodal emotion recognition performance. In practice, extensive experiments can demonstrate the effectiveness of the NORM-TR and the noise-aware learning scheme for dealing with both explicitly added noisy information and the normal multimodal sequence with implicit noises. On several popular multimodal datasets (e.g., MOSI, MOSEI, IEMOCAP, and RML), our NORM-TR achieves state-of-the-art performance and outperforms existing methods by a large margin, which demonstrates that the ability to resist noisy information in multimodal input is important for effective emotion recognition.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142869955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Polynomial Implicit Neural Framework for Promoting Shape Awareness in Generative Models 促进生成模型形状感知的多项式隐式神经网络框架
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-20 DOI: 10.1007/s11263-024-02270-w
Utkarsh Nath, Rajhans Singh, Ankita Shukla, Kuldeep Kulkarni, Pavan Turaga

Polynomial functions have been employed to represent shape-related information in 2D and 3D computer vision, even from the very early days of the field. In this paper, we present a framework using polynomial-type basis functions to promote shape awareness in contemporary generative architectures. The benefits of using a learnable form of polynomial basis functions as drop-in modules into generative architectures are several—including promoting shape awareness, a noticeable disentanglement of shape from texture, and high quality generation. To enable the architectures to have a small number of parameters, we further use implicit neural representations (INR) as the base architecture. Most INR architectures rely on sinusoidal positional encoding, which accounts for high-frequency information in data. However, the finite encoding size restricts the model’s representational power. Higher representational power is critically needed to transition from representing a single given image to effectively representing large and diverse datasets. Our approach addresses this gap by representing an image with a polynomial function and eliminates the need for positional encodings. Therefore, to achieve a progressively higher degree of polynomial representation, we use element-wise multiplications between features and affine-transformed coordinate locations after every ReLU layer. The proposed method is evaluated qualitatively and quantitatively on large datasets such as ImageNet. The proposed Poly-INR model performs comparably to state-of-the-art generative models without any convolution, normalization, or self-attention layers, and with significantly fewer trainable parameters. With substantially fewer training parameters and higher representative power, our approach paves the way for broader adoption of INR models for generative modeling tasks in complex domains. The code is publicly available at https://github.com/Rajhans0/Poly_INR.

多项式函数已经被用来表示二维和三维计算机视觉中的形状相关信息,甚至从该领域的早期开始。在本文中,我们提出了一个使用多项式型基函数的框架来促进当代生成建筑中的形状意识。使用可学习形式的多项式基函数作为插入模块到生成体系结构中的好处有几个,包括促进形状意识,形状与纹理的明显分离,以及高质量的生成。为了使结构具有较少的参数,我们进一步使用隐式神经表示(INR)作为基础结构。大多数INR架构依赖于正弦位置编码,该编码占数据中的高频信息。然而,有限的编码大小限制了模型的表示能力。从表示单个给定图像到有效地表示大型和多样化的数据集,迫切需要更高的表示能力。我们的方法通过用多项式函数表示图像来解决这个问题,并且消除了对位置编码的需要。因此,为了实现更高程度的多项式表示,我们在每个ReLU层之后使用特征和仿射变换坐标位置之间的元素乘法。在ImageNet等大型数据集上对该方法进行了定性和定量评价。所提出的Poly-INR模型的性能与最先进的生成模型相当,没有任何卷积、归一化或自关注层,并且具有更少的可训练参数。通过更少的训练参数和更高的代表性,我们的方法为在复杂领域中更广泛地采用INR模型进行生成建模任务铺平了道路。该代码可在https://github.com/Rajhans0/Poly_INR上公开获得。
{"title":"Polynomial Implicit Neural Framework for Promoting Shape Awareness in Generative Models","authors":"Utkarsh Nath, Rajhans Singh, Ankita Shukla, Kuldeep Kulkarni, Pavan Turaga","doi":"10.1007/s11263-024-02270-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02270-w","url":null,"abstract":"<p>Polynomial functions have been employed to represent shape-related information in 2D and 3D computer vision, even from the very early days of the field. In this paper, we present a framework using polynomial-type basis functions to promote shape awareness in contemporary generative architectures. The benefits of using a learnable form of polynomial basis functions as drop-in modules into generative architectures are several—including promoting shape awareness, a noticeable disentanglement of shape from texture, and high quality generation. To enable the architectures to have a small number of parameters, we further use implicit neural representations (INR) as the base architecture. Most INR architectures rely on sinusoidal positional encoding, which accounts for high-frequency information in data. However, the finite encoding size restricts the model’s representational power. Higher representational power is critically needed to transition from representing a single given image to effectively representing large and diverse datasets. Our approach addresses this gap by representing an image with a polynomial function and eliminates the need for positional encodings. Therefore, to achieve a progressively higher degree of polynomial representation, we use element-wise multiplications between features and affine-transformed coordinate locations after every ReLU layer. The proposed method is evaluated qualitatively and quantitatively on large datasets such as ImageNet. The proposed Poly-INR model performs comparably to state-of-the-art generative models without any convolution, normalization, or self-attention layers, and with significantly fewer trainable parameters. With substantially fewer training parameters and higher representative power, our approach paves the way for broader adoption of INR models for generative modeling tasks in complex domains. The code is publicly available at https://github.com/Rajhans0/Poly_INR.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"1 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142858370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Attention Learning for Pre-operative Lymph Node Metastasis Prediction in Pancreatic Cancer via Multi-object Relationship Modeling 基于多目标关系模型的深度注意学习用于胰腺癌术前淋巴结转移预测
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-20 DOI: 10.1007/s11263-024-02314-1
Zhilin Zheng, Xu Fang, Jiawen Yao, Mengmeng Zhu, Le Lu, Yu Shi, Hong Lu, Jianping Lu, Ling Zhang, Chengwei Shao, Yun Bian

Lymph node (LN) metastasis status is one of the most critical prognostic and cancer staging clinical factors for patients with resectable pancreatic ductal adenocarcinoma (PDAC, generally for any types of solid malignant tumors). Pre-operative prediction of LN metastasis from non-invasive CT imaging is highly desired, as it might be directly and conveniently used to guide the follow-up neoadjuvant treatment decision and surgical planning. Most previous studies only use the tumor characteristics in CT imaging alone to implicitly infer LN metastasis. To the best of our knowledge, this is the first work to propose a fully-automated LN segmentation and identification network to directly facilitate the LN metastasis status prediction task for patients with PDAC. Specially, (1) we explore the anatomical spatial context priors of pancreatic LN locations by generating a guiding attention map from related organs and vessels to assist segmentation and infer LN status. As such, LN segmentation is impelled to focus on regions that are anatomically adjacent or plausible with respect to the specific organs and vessels. (2) The metastasized LN identification network is trained to classify the segmented LN instances into positives or negatives by reusing the segmentation network as a pre-trained backbone and padding a new classification head. (3) Importantly, we develop a LN metastasis status prediction network that combines and aggregates the holistic patient-wise diagnosis information of both LN segmentation/identification and deep imaging characteristics by the PDAC tumor region. Extensive quantitative nested five-fold cross-validation is conducted on a discovery dataset of 749 patients with PDAC. External multi-center clinical evaluation is further performed on two other hospitals of 191 total patients. Our multi-staged LN metastasis status prediction network statistically significantly outperforms strong baselines of nnUNet and several other compared methods, including CT-reported LN status, radiomics, and deep learning models.

淋巴结(LN)转移状态是可切除胰导管腺癌(PDAC,一般适用于任何类型的实体恶性肿瘤)患者最重要的预后和癌症分期临床因素之一。术前通过无创CT成像预测淋巴结转移是非常必要的,因为它可以直接方便地指导后续新辅助治疗决策和手术计划。以往的研究大多仅利用CT影像的肿瘤特征来推断淋巴结转移。据我们所知,这是第一个提出全自动LN分割和识别网络来直接促进PDAC患者LN转移状态预测任务的工作。特别地,(1)我们通过生成相关器官和血管的指导性注意图,探索胰腺LN位置的解剖空间背景先验,以辅助分割和推断LN状态。因此,LN分割必须集中在解剖学上邻近或与特定器官和血管相关的区域。(2)通过重用分割网络作为预训练的主干,填充新的分类头,训练转移LN识别网络,将分割后的LN实例分类为阳性或阴性。(3)重要的是,我们开发了一个LN转移状态预测网络,该网络结合并汇总了LN分割/识别和PDAC肿瘤区域深度成像特征的整体患者诊断信息。对749例PDAC患者的发现数据集进行了广泛的定量嵌套五倍交叉验证。对另外两家医院共191例患者进行外部多中心临床评价。我们的多阶段淋巴结转移状态预测网络在统计上显著优于nnUNet和其他几种比较方法的强基线,包括ct报告的淋巴结转移状态、放射组学和深度学习模型。
{"title":"Deep Attention Learning for Pre-operative Lymph Node Metastasis Prediction in Pancreatic Cancer via Multi-object Relationship Modeling","authors":"Zhilin Zheng, Xu Fang, Jiawen Yao, Mengmeng Zhu, Le Lu, Yu Shi, Hong Lu, Jianping Lu, Ling Zhang, Chengwei Shao, Yun Bian","doi":"10.1007/s11263-024-02314-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02314-1","url":null,"abstract":"<p>Lymph node (LN) metastasis status is one of the most critical prognostic and cancer staging clinical factors for patients with resectable pancreatic ductal adenocarcinoma (PDAC, generally for any types of solid malignant tumors). Pre-operative prediction of LN metastasis from non-invasive CT imaging is highly desired, as it might be directly and conveniently used to guide the follow-up neoadjuvant treatment decision and surgical planning. Most previous studies only use the tumor characteristics in CT imaging alone to implicitly infer LN metastasis. To the best of our knowledge, this is the first work to propose a fully-automated LN segmentation and identification network to directly facilitate the LN metastasis status prediction task for patients with PDAC. Specially, (1) we explore the anatomical spatial context priors of pancreatic LN locations by generating a guiding attention map from related organs and vessels to assist segmentation and infer LN status. As such, LN segmentation is impelled to focus on regions that are anatomically adjacent or plausible with respect to the specific organs and vessels. (2) The metastasized LN identification network is trained to classify the segmented LN instances into positives or negatives by reusing the segmentation network as a pre-trained backbone and padding a new classification head. (3) Importantly, we develop a LN metastasis status prediction network that combines and aggregates the holistic patient-wise diagnosis information of both LN segmentation/identification and deep imaging characteristics by the PDAC tumor region. Extensive quantitative nested five-fold cross-validation is conducted on a discovery dataset of 749 patients with PDAC. External multi-center clinical evaluation is further performed on two other hospitals of 191 total patients. Our multi-staged LN metastasis status prediction network statistically significantly outperforms strong baselines of nnUNet and several other compared methods, including CT-reported LN status, radiomics, and deep learning models.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142867002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Discriminative Features for Visual Tracking via Scenario Decoupling 基于场景解耦的视觉跟踪判别特征学习
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-19 DOI: 10.1007/s11263-024-02307-0
Yinchao Ma, Qianjin Yu, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang

Visual tracking aims to estimate object state automatically in a video sequence, which is challenging especially in complex scenarios. Recent Transformer-based trackers enable the interaction between the target template and search region in the feature extraction phase for target-aware feature learning, which have achieved superior performance. However, visual tracking is essentially a task to discriminate the specified target from the backgrounds. These trackers commonly ignore the role of background in feature learning, which may cause backgrounds to be mistakenly enhanced in complex scenarios, affecting temporal robustness and spatial discriminability. To address the above limitations, we propose a scenario-aware tracker (SATrack) based on a specifically designed scenario-aware Vision Transformer, which integrates a scenario knowledge extractor and a scenario knowledge modulator. The proposed SATrack enjoys several merits. Firstly, we design a novel scenario-aware Vision Transformer for visual tracking, which can decouple historic scenarios into explicit target and background knowledge to guide discriminative feature learning. Secondly, a scenario knowledge extractor is designed to dynamically acquire decoupled and compact scenario knowledge from video contexts, and a scenario knowledge modulator is designed to embed scenario knowledge into attention mechanisms for scenario-aware feature learning. Extensive experimental results on nine tracking benchmarks demonstrate that SATrack achieves new state-of-the-art performance with high FPS.

视觉跟踪的目标是在视频序列中自动估计物体的状态,这在复杂场景下尤其具有挑战性。最近的基于transformer的跟踪器在特征提取阶段实现了目标模板和搜索区域之间的交互,实现了目标感知特征学习,取得了优异的性能。然而,视觉跟踪本质上是一项区分特定目标和背景的任务。这些跟踪器通常忽略背景在特征学习中的作用,这可能导致背景在复杂场景下被错误地增强,影响时间鲁棒性和空间可分辨性。为了解决上述限制,我们提出了一种基于专门设计的场景感知视觉转换器的场景感知跟踪器(SATrack),该转换器集成了场景知识提取器和场景知识调制器。拟议的SATrack有几个优点。首先,我们设计了一种新的场景感知视觉转换器用于视觉跟踪,它可以将历史场景解耦为明确的目标和背景知识,以指导判别特征学习。其次,设计了场景知识提取器,从视频环境中动态获取解耦、紧凑的场景知识;设计了场景知识调制器,将场景知识嵌入到注意机制中,实现场景感知特征学习;在九个跟踪基准上的大量实验结果表明,SATrack在高FPS下实现了新的最先进的性能。
{"title":"Learning Discriminative Features for Visual Tracking via Scenario Decoupling","authors":"Yinchao Ma, Qianjin Yu, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang","doi":"10.1007/s11263-024-02307-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02307-0","url":null,"abstract":"<p>Visual tracking aims to estimate object state automatically in a video sequence, which is challenging especially in complex scenarios. Recent Transformer-based trackers enable the interaction between the target template and search region in the feature extraction phase for target-aware feature learning, which have achieved superior performance. However, visual tracking is essentially a task to discriminate the specified target from the backgrounds. These trackers commonly ignore the role of background in feature learning, which may cause backgrounds to be mistakenly enhanced in complex scenarios, affecting temporal robustness and spatial discriminability. To address the above limitations, we propose a scenario-aware tracker (SATrack) based on a specifically designed scenario-aware Vision Transformer, which integrates a scenario knowledge extractor and a scenario knowledge modulator. The proposed SATrack enjoys several merits. Firstly, we design a novel scenario-aware Vision Transformer for visual tracking, which can decouple historic scenarios into explicit target and background knowledge to guide discriminative feature learning. Secondly, a scenario knowledge extractor is designed to dynamically acquire decoupled and compact scenario knowledge from video contexts, and a scenario knowledge modulator is designed to embed scenario knowledge into attention mechanisms for scenario-aware feature learning. Extensive experimental results on nine tracking benchmarks demonstrate that SATrack achieves new state-of-the-art performance with high FPS.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142858369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hard-Normal Example-Aware Template Mutual Matching for Industrial Anomaly Detection 工业异常检测中可识别硬正常样例的模板相互匹配
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-18 DOI: 10.1007/s11263-024-02323-0
Zixuan Chen, Xiaohua Xie, Lingxiao Yang, Jian-Huang Lai

Anomaly detectors are widely used in industrial manufacturing to detect and localize unknown defects in query images. These detectors are trained on anomaly-free samples and have successfully distinguished anomalies from most normal samples. However, hard-normal examples are scattered and far apart from most normal samples, and thus they are often mistaken for anomalies by existing methods. To address this issue, we propose Hard-normal Example-aware Template Mutual Matching (HETMM), an efficient framework to build a robust prototype-based decision boundary. Specifically, HETMM employs the proposed Affine-invariant Template Mutual Matching (ATMM) to mitigate the affection brought by the affine transformations and easy-normal examples. By mutually matching the pixel-level prototypes within the patch-level search spaces between query and template set, ATMM can accurately distinguish between hard-normal examples and anomalies, achieving low false-positive and missed-detection rates. In addition, we also propose PTS to compress the original template set for speed-up. PTS selects cluster centres and hard-normal examples to preserve the original decision boundary, allowing this tiny set to achieve comparable performance to the original one. Extensive experiments demonstrate that HETMM outperforms state-of-the-art methods, while using a 60-sheet tiny set can achieve competitive performance and real-time inference speed (around 26.1 FPS) on a Quadro 8000 RTX GPU. HETMM is training-free and can be hot-updated by directly inserting novel samples into the template set, which can promptly address some incremental learning issues in industrial manufacturing.

异常检测器广泛应用于工业制造中,用于检测和定位查询图像中的未知缺陷。这些检测器在无异常样本上进行训练,并成功地将异常与大多数正常样本区分开来。然而,硬正态样本是分散的,与大多数正态样本相距甚远,因此它们经常被现有的方法误认为是异常。为了解决这一问题,我们提出了一种基于原型的鲁棒决策边界的有效框架——硬正态样本感知模板相互匹配(HETMM)。具体来说,HETMM采用了提出的仿射不变模板相互匹配(ATMM)来减轻仿射变换和易正态例带来的影响。通过在查询和模板集之间的补丁级搜索空间中相互匹配像素级原型,ATMM可以准确区分硬正态样本和异常,实现低假阳性和漏检率。此外,我们还提出PTS对原始模板集进行压缩以提高速度。PTS选择集群中心和硬正态示例来保留原始决策边界,从而使这个小集合达到与原始集合相当的性能。大量的实验表明,HETMM优于最先进的方法,而在Quadro 8000 RTX GPU上使用60页的微型集可以获得具有竞争力的性能和实时推理速度(约26.1 FPS)。HETMM无需训练,可以通过直接将新样本插入模板集进行热更新,可以快速解决工业制造中的一些增量学习问题。
{"title":"Hard-Normal Example-Aware Template Mutual Matching for Industrial Anomaly Detection","authors":"Zixuan Chen, Xiaohua Xie, Lingxiao Yang, Jian-Huang Lai","doi":"10.1007/s11263-024-02323-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02323-0","url":null,"abstract":"<p>Anomaly detectors are widely used in industrial manufacturing to detect and localize unknown defects in query images. These detectors are trained on anomaly-free samples and have successfully distinguished anomalies from most normal samples. However, hard-normal examples are scattered and far apart from most normal samples, and thus they are often mistaken for anomalies by existing methods. To address this issue, we propose <b>H</b>ard-normal <b>E</b>xample-aware <b>T</b>emplate <b>M</b>utual <b>M</b>atching (HETMM), an efficient framework to build a robust prototype-based decision boundary. Specifically, <i>HETMM</i> employs the proposed <b>A</b>ffine-invariant <b>T</b>emplate <b>M</b>utual <b>M</b>atching (ATMM) to mitigate the affection brought by the affine transformations and easy-normal examples. By mutually matching the pixel-level prototypes within the patch-level search spaces between query and template set, <i>ATMM</i> can accurately distinguish between hard-normal examples and anomalies, achieving low false-positive and missed-detection rates. In addition, we also propose <i>PTS</i> to compress the original template set for speed-up. <i>PTS</i> selects cluster centres and hard-normal examples to preserve the original decision boundary, allowing this tiny set to achieve comparable performance to the original one. Extensive experiments demonstrate that <i>HETMM</i> outperforms state-of-the-art methods, while using a 60-sheet tiny set can achieve competitive performance and real-time inference speed (around 26.1 FPS) on a Quadro 8000 RTX GPU. <i>HETMM</i> is training-free and can be hot-updated by directly inserting novel samples into the template set, which can promptly address some incremental learning issues in industrial manufacturing.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"26 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142848869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond Talking – Generating Holistic 3D Human Dyadic Motion for Communication 超越说话--生成用于交流的整体三维人类双向运动
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-17 DOI: 10.1007/s11263-024-02300-7
Mingze Sun, Chao Xu, Xinyu Jiang, Yang Liu, Baigui Sun, Ruqi Huang

In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the HoCo holistic communication dataset, which is a valuable resource for future research. Our HoCo dataset and code will be released for research purposes upon acceptance.

在本文中,我们介绍了一项专注于人类交流的创新任务,旨在为说话者和听话者生成三维整体人类动作。我们的方法的核心是结合因子化来解耦音频特征和文本语义信息的组合,从而促进创建更逼真、更协调的动作。我们针对说话者和听话者的整体动作分别训练 VQ-VAE。我们考虑了说话者和听话者之间的实时相互影响,并提出了一种新颖的基于链式变压器的自动回归模型,该模型专为有效描述真实世界的交流场景而设计,可同时生成说话者和听话者的动作。这些设计确保了我们生成的结果既协调又多样。我们的方法在两个基准数据集上展示了最先进的性能。此外,我们还介绍了 HoCo 整体交流数据集,它是未来研究的宝贵资源。我们的 HoCo 数据集和代码将在获得认可后发布,用于研究目的。
{"title":"Beyond Talking – Generating Holistic 3D Human Dyadic Motion for Communication","authors":"Mingze Sun, Chao Xu, Xinyu Jiang, Yang Liu, Baigui Sun, Ruqi Huang","doi":"10.1007/s11263-024-02300-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02300-7","url":null,"abstract":"<p>In this paper, we introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners. Central to our approach is the incorporation of factorization to decouple audio features and the combination of textual semantic information, thereby facilitating the creation of more realistic and coordinated movements. We separately train VQ-VAEs with respect to the holistic motions of both speaker and listener. We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model specifically designed to characterize real-world communication scenarios effectively which can generate the motions of both the speaker and the listener simultaneously. These designs ensure that the results we generate are both coordinated and diverse. Our approach demonstrates state-of-the-art performance on two benchmark datasets. Furthermore, we introduce the <span>HoCo</span> holistic communication dataset, which is a valuable resource for future research. Our <span>HoCo</span> dataset and code will be released for research purposes upon acceptance.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"22 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph Hyper-3DG:文本到3d高斯生成通过超图
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-16 DOI: 10.1007/s11263-024-02298-y
Donglin Di, Jiahui Yang, Chaofan Luo, Zhou Xue, Wei Chen, Xun Yang, Yue Gao

Text-to-3D generation represents an exciting field that has seen rapid advancements, facilitating the transformation of textual descriptions into detailed 3D models. However, current progress often neglects the intricate high-order correlation of geometry and texture within 3D objects, leading to challenges such as over-smoothness, over-saturation and the Janus problem. In this work, we propose a method named “3D Gaussian Generation via Hypergraph (Hyper-3DG)”, designed to capture the sophisticated high-order correlations present within 3D objects. Our framework is anchored by a well-established mainflow and an essential module, named “Geometry and Texture Hypergraph Refiner (HGRefiner)”. This module not only refines the representation of 3D Gaussians but also accelerates the update process of these 3D Gaussians by conducting the Patch-3DGS Hypergraph Learning on both explicit attributes and latent visual features. Our framework allows for the production of finely generated 3D objects within a cohesive optimization, effectively circumventing degradation. Extensive experimentation has shown that our proposed method significantly enhances the quality of 3D generation while incurring no additional computational overhead for the underlying framework. (Project code: https://github.com/yjhboy/Hyper3DG).

文本到 3D 的生成是一个令人兴奋的领域,该领域取得了突飞猛进的发展,促进了文本描述到详细 3D 模型的转化。然而,当前的进展往往忽视了三维物体内部几何和纹理之间错综复杂的高阶相关性,从而导致了诸如过度平滑、过度饱和和杰纳斯问题等挑战。在这项工作中,我们提出了一种名为 "通过超图生成三维高斯(Hyper-3DG)"的方法,旨在捕捉三维物体内部复杂的高阶相关性。我们的框架由一个成熟的主流程和一个名为 "几何与纹理超图细化器(HGRefiner)"的重要模块构成。该模块不仅能完善三维高斯的表示,还能通过对显性属性和潜在视觉特征进行 Patch-3DGS 超图学习,加速这些三维高斯的更新过程。我们的框架允许在内聚优化中生成精细的三维对象,有效避免了退化。广泛的实验表明,我们提出的方法显著提高了三维生成的质量,同时不会给底层框架带来额外的计算开销。(项目代码:https://github.com/yjhboy/Hyper3DG)。
{"title":"Hyper-3DG: Text-to-3D Gaussian Generation via Hypergraph","authors":"Donglin Di, Jiahui Yang, Chaofan Luo, Zhou Xue, Wei Chen, Xun Yang, Yue Gao","doi":"10.1007/s11263-024-02298-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02298-y","url":null,"abstract":"<p>Text-to-3D generation represents an exciting field that has seen rapid advancements, facilitating the transformation of textual descriptions into detailed 3D models. However, current progress often neglects the intricate high-order correlation of geometry and texture within 3D objects, leading to challenges such as over-smoothness, over-saturation and the Janus problem. In this work, we propose a method named “3D Gaussian Generation via Hypergraph (Hyper-3DG)”, designed to capture the sophisticated high-order correlations present within 3D objects. Our framework is anchored by a well-established mainflow and an essential module, named “Geometry and Texture Hypergraph Refiner (HGRefiner)”. This module not only refines the representation of 3D Gaussians but also accelerates the update process of these 3D Gaussians by conducting the Patch-3DGS Hypergraph Learning on both explicit attributes and latent visual features. Our framework allows for the production of finely generated 3D objects within a cohesive optimization, effectively circumventing degradation. Extensive experimentation has shown that our proposed method significantly enhances the quality of 3D generation while incurring no additional computational overhead for the underlying framework. (Project code: https://github.com/yjhboy/Hyper3DG).</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"63 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Relation-Guided Adversarial Learning for Data-Free Knowledge Transfer 面向无数据知识转移的关系导向对抗学习
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-13 DOI: 10.1007/s11263-024-02303-4
Yingping Liang, Ying Fu

Data-free knowledge distillation transfers knowledge by recovering training data from a pre-trained model. Despite the recent success of seeking global data diversity, the diversity within each class and the similarity among different classes are largely overlooked, resulting in data homogeneity and limited performance. In this paper, we introduce a novel Relation-Guided Adversarial Learning method with triplet losses, which solves the homogeneity problem from two aspects. To be specific, our method aims to promote both intra-class diversity and inter-class confusion of the generated samples. To this end, we design two phases, an image synthesis phase and a student training phase. In the image synthesis phase, we construct an optimization process to push away samples with the same labels and pull close samples with different labels, leading to intra-class diversity and inter-class confusion, respectively. Then, in the student training phase, we perform an opposite optimization, which adversarially attempts to reduce the distance of samples of the same classes and enlarge the distance of samples of different classes. To mitigate the conflict of seeking high global diversity and keeping inter-class confusing, we propose a focal weighted sampling strategy by selecting the negative in the triplets unevenly within a finite range of distance. RGAL shows significant improvement over previous state-of-the-art methods in accuracy and data efficiency. Besides, RGAL can be inserted into state-of-the-art methods on various data-free knowledge transfer applications. Experiments on various benchmarks demonstrate the effectiveness and generalizability of our proposed method on various tasks, specially data-free knowledge distillation, data-free quantization, and non-exemplar incremental learning. Our code will be publicly available to the community.

无数据知识蒸馏通过从预训练模型中恢复训练数据来转移知识。尽管最近在寻求全局数据多样性方面取得了成功,但每个类内部的多样性和不同类之间的相似性在很大程度上被忽视了,从而导致数据同质性和性能受限。本文提出了一种新的基于三元损失的关系引导对抗学习方法,从两个方面解决了同质性问题。具体来说,我们的方法旨在促进生成样本的类内多样性和类间混淆。为此,我们设计了两个阶段,图像合成阶段和学生训练阶段。在图像合成阶段,我们构建了一个优化过程,将具有相同标签的样本推开,将具有不同标签的样本拉近,分别导致类内多样性和类间混淆。然后,在学生训练阶段,我们进行相反的优化,对抗性地尝试减少相同类别的样本距离,扩大不同类别的样本距离。为了缓解寻求高全局多样性和保持类间混淆的冲突,我们提出了在有限距离范围内不均匀地选择三组中的负值的焦点加权抽样策略。RGAL在准确性和数据效率方面比以前的最先进的方法有了显著的改进。此外,RGAL可以插入到各种无数据知识转移应用程序的最先进的方法中。在各种基准上的实验证明了我们提出的方法在各种任务上的有效性和泛化性,特别是无数据知识蒸馏、无数据量化和非范例增量学习。我们的代码将对社区公开。
{"title":"Relation-Guided Adversarial Learning for Data-Free Knowledge Transfer","authors":"Yingping Liang, Ying Fu","doi":"10.1007/s11263-024-02303-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02303-4","url":null,"abstract":"<p>Data-free knowledge distillation transfers knowledge by recovering training data from a pre-trained model. Despite the recent success of seeking global data diversity, the diversity within each class and the similarity among different classes are largely overlooked, resulting in data homogeneity and limited performance. In this paper, we introduce a novel Relation-Guided Adversarial Learning method with triplet losses, which solves the homogeneity problem from two aspects. To be specific, our method aims to promote both intra-class diversity and inter-class confusion of the generated samples. To this end, we design two phases, an image synthesis phase and a student training phase. In the image synthesis phase, we construct an optimization process to push away samples with the same labels and pull close samples with different labels, leading to intra-class diversity and inter-class confusion, respectively. Then, in the student training phase, we perform an opposite optimization, which adversarially attempts to reduce the distance of samples of the same classes and enlarge the distance of samples of different classes. To mitigate the conflict of seeking high global diversity and keeping inter-class confusing, we propose a focal weighted sampling strategy by selecting the negative in the triplets unevenly within a finite range of distance. RGAL shows significant improvement over previous state-of-the-art methods in accuracy and data efficiency. Besides, RGAL can be inserted into state-of-the-art methods on various data-free knowledge transfer applications. Experiments on various benchmarks demonstrate the effectiveness and generalizability of our proposed method on various tasks, specially data-free knowledge distillation, data-free quantization, and non-exemplar incremental learning. Our code will be publicly available to the community.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"76 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142816370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask MaskDiffusion:用条件蒙版增强文本到图像的一致性
IF 19.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-12 DOI: 10.1007/s11263-024-02294-2
Yupeng Zhou, Daquan Zhou, Yaxing Wang, Jiashi Feng, Qibin Hou

Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. However, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the erroneous generation of objects and their attributes is the inadequate cross-modality relation learning between the prompt and the generated images. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in the semantic information embedding of the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can largely enhance their capability to correctly generate objects and their attributes, with negligible computation overhead compared to the original diffusion models. Our project page is https://github.com/HVision-NKU/MaskDiffusion.

扩散模型的最新进展展示了它们产生视觉上引人注目的图像的令人印象深刻的能力。然而,确保生成的图像和给定的提示之间的紧密匹配仍然是一个持久的挑战。在这项工作中,我们发现导致错误生成对象及其属性的一个关键因素是提示和生成的图像之间的跨模态关系学习不足。为了更好地对齐提示和图像内容,我们使用自适应蒙版推进交叉注意,该蒙版以注意图和提示嵌入为条件,动态调整每个文本标记对图像特征的贡献。该机制明确地减少了文本编码器语义信息嵌入中的模糊性,从而提高了合成图像中文本与图像的一致性。我们的方法,称为MaskDiffusion,对于流行的预训练扩散模型来说,是不需要训练的,并且是热插拔的。当应用于潜在扩散模型时,我们的MaskDiffusion可以极大地增强其正确生成对象及其属性的能力,与原始扩散模型相比,计算开销可以忽略不计。我们的项目页面是https://github.com/HVision-NKU/MaskDiffusion。
{"title":"MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask","authors":"Yupeng Zhou, Daquan Zhou, Yaxing Wang, Jiashi Feng, Qibin Hou","doi":"10.1007/s11263-024-02294-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02294-2","url":null,"abstract":"<p>Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. However, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the erroneous generation of objects and their attributes is the inadequate cross-modality relation learning between the prompt and the generated images. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in the semantic information embedding of the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can largely enhance their capability to correctly generate objects and their attributes, with negligible computation overhead compared to the original diffusion models. Our project page is https://github.com/HVision-NKU/MaskDiffusion.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"47 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142809694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal of Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1