International Journal of Computer Vision最新文献_第7页

RepSNet: A Nucleus Instance Segmentation Model Based on Boundary Regression and Structural Re-Parameterization 基于边界回归和结构重参数化的核实例分割模型

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-02 DOI: 10.1007/s11263-024-02332-z

Shengchun Xiong, Xiangru Li, Yunpeng Zhong, Wanfen Peng

Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary regression and a structural re-parameterization scheme for segmenting and classifying the nuclei in H&E-stained histopathological images. First, RepSNet estimates the boundary position information (BPI) of the parent nucleus for each pixel. The BPI estimation incorporates the local information of the pixel and the contextual information of the parent nucleus. Then, the nucleus boundary is estimated by aggregating the BPIs from a series of pixels using a proposed boundary voting mechanism (BVM), and the instance segmentation results are computed from the estimated nucleus boundary using a connected component analysis procedure. The BVM intrinsically achieves a kind of synergistic belief enhancement among the BPIs from various pixels. Therefore, different from the methods available in literature that obtain nucleus boundaries based on a direct pixel recognition scheme, RepSNet computes its boundary decisions based on some guidances from macroscopic information using an integration mechanism. In addition, RepSNet employs a re-parametrizable encoder-decoder structure. This model can not only aggregate features from some receptive fields with various scales which helps segmentation accuracy improvement, but also reduce the parameter amount and computational burdens in the model inference phase through the structural re-parameterization technique. In the experimental comparisons and evaluations on the Lizard dataset, RepSNet demonstrated superior segmentation accuracy and inference speed compared to several typical benchmark models. The experimental code, dataset splitting configuration and the pre-trained model were released at https://github.com/luckyrz0/RepSNet.

病理诊断是肿瘤诊断的金标准，而核实例分割是数字病理分析和病理诊断的关键步骤。然而，模型的计算效率和重叠目标的处理是该问题研究的主要挑战。为此，基于核边界回归和结构重参数化方案设计神经网络模型RepSNet，对H&； e染色组织病理图像中的核进行分割和分类。首先，RepSNet对每个像素估计母核的边界位置信息（BPI）。BPI估计结合了像素的局部信息和母核的上下文信息。然后，利用提出的边界投票机制（BVM）对一系列像素点的bpi进行聚合，估计出核边界，并利用连通分量分析程序从估计的核边界计算出实例分割结果。BVM本质上实现了不同像素点的bpi之间的一种协同信念增强。因此，与文献中基于直接像素识别方案获得核边界的方法不同，RepSNet采用集成机制，基于宏观信息的一些指导来计算其边界决策。此外，RepSNet采用可重新参数化的编码器-解码器结构。该模型不仅可以对不同尺度的接收场特征进行聚合，提高分割精度，还可以通过结构重参数化技术减少模型推理阶段的参数数量和计算量。在蜥蜴数据集的实验比较和评估中，与几个典型的基准模型相比，RepSNet显示出更高的分割精度和推理速度。实验代码、数据集分割配置和预训练模型在https://github.com/luckyrz0/RepSNet上发布。

{"title":"RepSNet: A Nucleus Instance Segmentation Model Based on Boundary Regression and Structural Re-Parameterization","authors":"Shengchun Xiong, Xiangru Li, Yunpeng Zhong, Wanfen Peng","doi":"10.1007/s11263-024-02332-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02332-z","url":null,"abstract":"Pathological diagnosis is the gold standard for tumor diagnosis, and nucleus instance segmentation is a key step in digital pathology analysis and pathological diagnosis. However, the computational efficiency of the model and the treatment of overlapping targets are the major challenges in the studies of this problem. To this end, a neural network model RepSNet was designed based on a nucleus boundary regression and a structural re-parameterization scheme for segmenting and classifying the nuclei in H&E-stained histopathological images. First, RepSNet estimates the boundary position information (BPI) of the parent nucleus for each pixel. The BPI estimation incorporates the local information of the pixel and the contextual information of the parent nucleus. Then, the nucleus boundary is estimated by aggregating the BPIs from a series of pixels using a proposed boundary voting mechanism (BVM), and the instance segmentation results are computed from the estimated nucleus boundary using a connected component analysis procedure. The BVM intrinsically achieves a kind of synergistic belief enhancement among the BPIs from various pixels. Therefore, different from the methods available in literature that obtain nucleus boundaries based on a direct pixel recognition scheme, RepSNet computes its boundary decisions based on some guidances from macroscopic information using an integration mechanism. In addition, RepSNet employs a re-parametrizable encoder-decoder structure. This model can not only aggregate features from some receptive fields with various scales which helps segmentation accuracy improvement, but also reduce the parameter amount and computational burdens in the model inference phase through the structural re-parameterization technique. In the experimental comparisons and evaluations on the Lizard dataset, RepSNet demonstrated superior segmentation accuracy and inference speed compared to several typical benchmark models. The experimental code, dataset splitting configuration and the pre-trained model were released at https://github.com/luckyrz0/RepSNet.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142917313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pseudo-Plane Regularized Signed Distance Field for Neural Indoor Scene Reconstruction 神经室内场景重建的伪平面正则化签名距离场

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-31 DOI: 10.1007/s11263-024-02319-w

Jing Li, Jinpeng Yu, Ruoyu Wang, Shenghua Gao

Given only a set of images, neural implicit surface representation has shown its capability in 3D surface reconstruction. However, as the nature of per-scene optimization is based on the volumetric rendering of color, previous neural implicit surface reconstruction methods usually fail in the low-textured regions, including floors, walls, etc., which commonly exist for indoor scenes. Being aware of the fact that these low-textured regions usually correspond to planes, without introducing additional ground-truth supervisory signals or making additional assumptions about the room layout, we propose to leverage a novel Pseudo-plane regularized Signed Distance Field (PPlaneSDF) for indoor scene reconstruction. Specifically, we consider adjacent pixels with similar colors to be on the same pseudo-planes. The plane parameters are then estimated on the fly during training by an efficient and effective two-step scheme. Then the signed distances of the points on the planes are regularized by the estimated plane parameters in the training phase. As the unsupervised plane segments are usually noisy and inaccurate, we propose to assign different weights to the sampled points on the plane in plane estimation as well as the regularization loss. The weights come by fusing the plane segments from different views. As the sampled rays in the planar regions are redundant, leading to inefficient training, we further propose a keypoint-guided rays sampling strategy that attends to the informative textured regions with large color variations, and the implicit network gets a better reconstruction, compared with the original uniform ray sampling strategy. Experiments show that our PPlaneSDF achieves competitive reconstruction performance in Manhattan scenes. Further, as we do not introduce any additional room layout assumption, our PPlaneSDF generalizes well to the reconstruction of non-Manhattan scenes.

仅在给定一组图像的情况下，神经隐式表面表示在三维表面重建中显示出了它的能力。然而，由于逐场景优化的本质是基于颜色的体积渲染，以往的神经隐式表面重建方法通常在低纹理区域失败，包括地板、墙壁等，这些区域通常存在于室内场景中。意识到这些低纹理区域通常对应于平面，而不引入额外的地面真值监控信号或对房间布局做出额外的假设，我们建议利用一种新的伪平面正则化签名距离场（PPlaneSDF）进行室内场景重建。具体来说，我们认为具有相似颜色的相邻像素位于相同的伪平面上。然后在训练过程中，通过一种高效的两步法对平面参数进行动态估计。然后在训练阶段用估计的平面参数正则化平面上点的带符号距离。由于无监督平面段通常存在噪声和不准确的问题，我们提出在平面估计和正则化损失中对平面上的采样点赋予不同的权重。权重是通过融合来自不同视图的平面段来获得的。针对平面区域的采样光线冗余导致训练效率低下的问题，我们进一步提出了一种关键点引导的光线采样策略，该策略关注颜色变化较大的信息纹理区域，与原始的均匀射线采样策略相比，隐式网络得到了更好的重建。实验表明，我们的PPlaneSDF在曼哈顿场景中取得了具有竞争力的重建性能。此外，由于我们没有引入任何额外的房间布局假设，我们的PPlaneSDF很好地概括了非曼哈顿场景的重建。

{"title":"Pseudo-Plane Regularized Signed Distance Field for Neural Indoor Scene Reconstruction","authors":"Jing Li, Jinpeng Yu, Ruoyu Wang, Shenghua Gao","doi":"10.1007/s11263-024-02319-w","DOIUrl":"https://doi.org/10.1007/s11263-024-02319-w","url":null,"abstract":"Given only a set of images, neural implicit surface representation has shown its capability in 3D surface reconstruction. However, as the nature of per-scene optimization is based on the volumetric rendering of color, previous neural implicit surface reconstruction methods usually fail in the low-textured regions, including floors, walls, etc., which commonly exist for indoor scenes. Being aware of the fact that these low-textured regions usually correspond to planes, without introducing additional ground-truth supervisory signals or making additional assumptions about the room layout, we propose to leverage a novel Pseudo-plane regularized Signed Distance Field (PPlaneSDF) for indoor scene reconstruction. Specifically, we consider adjacent pixels with similar colors to be on the same pseudo-planes. The plane parameters are then estimated on the fly during training by an efficient and effective two-step scheme. Then the signed distances of the points on the planes are regularized by the estimated plane parameters in the training phase. As the unsupervised plane segments are usually noisy and inaccurate, we propose to assign different weights to the sampled points on the plane in plane estimation as well as the regularization loss. The weights come by fusing the plane segments from different views. As the sampled rays in the planar regions are redundant, leading to inefficient training, we further propose a keypoint-guided rays sampling strategy that attends to the informative textured regions with large color variations, and the implicit network gets a better reconstruction, compared with the original uniform ray sampling strategy. Experiments show that our PPlaneSDF achieves competitive reconstruction performance in Manhattan scenes. Further, as we do not introduce any additional room layout assumption, our PPlaneSDF generalizes well to the reconstruction of non-Manhattan scenes.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CSFRNet: Integrating Clothing Status Awareness for Long-Term Person Re-identification CSFRNet：整合服装身份意识，实现人的长期再认同

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-30 DOI: 10.1007/s11263-024-02315-0

Yan Huang, Yan Huang, Zhang Zhang, Qiang Wu, Yi Zhong, Liang Wang

Addressing the dynamic nature of long-term person re-identification (LT-reID) amid varying clothing conditions necessitates a departure from conventional methods. Traditional LT-reID strategies, mainly biometrics-based and data adaptation-based, each have their pitfalls. The former falters in environments lacking high-quality biometric data, while the latter loses efficacy with minimal or subtle clothing changes. To overcome these obstacles, we propose the clothing status-aware feature regularization network (CSFRNet). This novel approach seamlessly incorporates clothing status awareness into the feature learning process, significantly enhancing the adaptability and accuracy of LT-reID systems where clothing can either change completely, partially, or not at all over time, without the need for explicit clothing labels. The versatility of our CSFRNet is showcased on diverse LT-reID benchmarks, including Celeb-reID, Celeb-reID-light, PRCC, DeepChange, and LTCC, marking a significant advancement in the field by addressing the real-world variability of clothing in LT-reID scenarios.

在不同的服装条件下解决长期人员重新识别（LT-reID）的动态性需要与传统方法不同。传统的LT-reID策略，主要是基于生物特征和数据适应，都有其缺陷。前者在缺乏高质量生物特征数据的环境中表现不佳，而后者在最小或细微的服装变化中失去功效。为了克服这些障碍，我们提出了服装状态感知特征正则化网络（CSFRNet）。这种新颖的方法将服装状态感知无缝地整合到特征学习过程中，显著提高了LT-reID系统的适应性和准确性，在这种系统中，服装可以随着时间的推移完全、部分或根本不改变，而不需要明确的服装标签。我们的CSFRNet的多功能性在各种LT-reID基准测试中得到了展示，包括Celeb-reID、Celeb-reID-light、PRCC、DeepChange和LTCC，通过解决LT-reID场景中服装的真实可变性，标志着该领域的重大进步。

{"title":"CSFRNet: Integrating Clothing Status Awareness for Long-Term Person Re-identification","authors":"Yan Huang, Yan Huang, Zhang Zhang, Qiang Wu, Yi Zhong, Liang Wang","doi":"10.1007/s11263-024-02315-0","DOIUrl":"https://doi.org/10.1007/s11263-024-02315-0","url":null,"abstract":"Addressing the dynamic nature of long-term person re-identification (LT-reID) amid varying clothing conditions necessitates a departure from conventional methods. Traditional LT-reID strategies, mainly biometrics-based and data adaptation-based, each have their pitfalls. The former falters in environments lacking high-quality biometric data, while the latter loses efficacy with minimal or subtle clothing changes. To overcome these obstacles, we propose the clothing status-aware feature regularization network (CSFRNet). This novel approach seamlessly incorporates clothing status awareness into the feature learning process, significantly enhancing the adaptability and accuracy of LT-reID systems where clothing can either change completely, partially, or not at all over time, without the need for explicit clothing labels. The versatility of our CSFRNet is showcased on diverse LT-reID benchmarks, including Celeb-reID, Celeb-reID-light, PRCC, DeepChange, and LTCC, marking a significant advancement in the field by addressing the real-world variability of clothing in LT-reID scenarios.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"48 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142901709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AniClipart: Clipart Animation with Text-to-Video Priors AniClipart：剪贴画动画与文本到视频的先验

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-27 DOI: 10.1007/s11263-024-02306-1

Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao

Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.

剪贴画是一种预先制作好的图形艺术形式，它提供了一种方便有效的方式来说明视觉内容。将静态剪贴画图像转换为运动序列的传统工作流程既费力又耗时，涉及许多复杂的步骤，如索具，关键动画和中间。最近在文本到视频生成方面取得的进展在解决这一问题方面具有很大的潜力。然而，直接应用文本到视频生成模型往往难以保持剪贴画图像的视觉识别或生成卡通风格的运动，导致动画效果不理想。在本文中，我们介绍了AniClipart，一个将静态剪贴画图像转换为高质量运动序列的系统，该系统由文本到视频先验引导。为了生成卡通风格的平滑运动，我们首先在剪贴画图像的关键点上定义bsamizier曲线，作为运动正则化的一种形式。然后，我们通过优化视频分数蒸馏采样（VSDS）损失，将关键点的运动轨迹与提供的文本提示对齐，该损失在预训练的文本到视频扩散模型中编码了足够的自然运动知识。采用可微的As-Rigid-As-Possible形状变形算法，可以在保持变形刚度的情况下实现端到端优化。实验结果表明，所提出的AniClipart在文本-视频对齐、视觉身份保持和运动一致性方面始终优于现有的图像-视频生成模型。此外，我们还展示了AniClipart的多功能性，通过调整它来生成更广泛的动画格式，例如分层动画，它允许拓扑更改。

{"title":"AniClipart: Clipart Animation with Text-to-Video Priors","authors":"Ronghuan Wu, Wanchao Su, Kede Ma, Jing Liao","doi":"10.1007/s11263-024-02306-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02306-1","url":null,"abstract":"Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Combating Label Noise with a General Surrogate Model for Sample Selection 用通用代理模型对抗标签噪声进行样本选择

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-27 DOI: 10.1007/s11263-024-02324-z

Chao Liang, Linchao Zhu, Humphrey Shi, Yi Yang

Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss samples are regarded as clean ones. Nevertheless, such a strategy relies on the learning dynamics of each data instance. Some noisy samples are still memorized due to frequently occurring corrupted learning patterns. To tackle this problem, a training-free surrogate model is preferred, freeing from the effect of memorization. In this work, we propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically. CLIP brings external knowledge to facilitate the selection of clean samples with its ability of text-image alignment. Furthermore, a margin adaptive loss is designed to regularize the selection bias introduced by CLIP, providing robustness to label noise. We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. Our method achieves significant improvement without CLIP involved during the inference stage.

现代深度学习系统需要大量数据。利用网络数据进行学习是一种可行的解决方案，但不可避免地会引入标签噪声，从而影响深度神经网络的性能。样本选择是处理标签噪声的有效方法。关键是根据一些标准分离干净的样品。以往的方法更注重小损耗准则，将小损耗样品视为干净样品。然而，这种策略依赖于每个数据实例的学习动态。由于经常出现的错误学习模式，一些有噪声的样本仍然被记忆。为了解决这个问题，不需要训练的代理模型是首选，它可以摆脱记忆的影响。在这项工作中，我们提出利用视觉语言代理模型CLIP自动过滤噪声样本。CLIP利用文本-图像对齐的能力，引入外部知识，方便干净样本的选择。此外，设计了一个边际自适应损失来正则化CLIP引入的选择偏差，提供对标记噪声的鲁棒性。我们在真实世界和合成噪声数据集上验证了我们提出的方法的有效性。我们的方法在推理阶段不涉及CLIP的情况下取得了显著的改进。

{"title":"Combating Label Noise with a General Surrogate Model for Sample Selection","authors":"Chao Liang, Linchao Zhu, Humphrey Shi, Yi Yang","doi":"10.1007/s11263-024-02324-z","DOIUrl":"https://doi.org/10.1007/s11263-024-02324-z","url":null,"abstract":"Modern deep learning systems are data-hungry. Learning with web data is one of the feasible solutions, but will introduce label noise inevitably, which can hinder the performance of deep neural networks. Sample selection is an effective way to deal with label noise. The key is to separate clean samples based on some criterion. Previous methods pay more attention to the small loss criterion where small-loss samples are regarded as clean ones. Nevertheless, such a strategy relies on the learning dynamics of each data instance. Some noisy samples are still memorized due to frequently occurring corrupted learning patterns. To tackle this problem, a training-free surrogate model is preferred, freeing from the effect of memorization. In this work, we propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically. CLIP brings external knowledge to facilitate the selection of clean samples with its ability of text-image alignment. Furthermore, a margin adaptive loss is designed to regularize the selection bias introduced by CLIP, providing robustness to label noise. We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets. Our method achieves significant improvement without CLIP involved during the inference stage.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"23 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring Homogeneous and Heterogeneous Consistent Label Associations for Unsupervised Visible-Infrared Person ReID 探索无监督可见红外人ReID的同质和异质一致标签关联

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-26 DOI: 10.1007/s11263-024-02322-1

Lingfeng He, De Cheng, Nannan Wang, Xinbo Gao

Unsupervised visible-infrared person re-identification (USL-VI-ReID) endeavors to retrieve pedestrian images of the same identity from different modalities without annotations. While prior work focuses on establishing cross-modality pseudo-label associations to bridge the modality-gap, they ignore maintaining the instance-level homogeneous and heterogeneous consistency between the feature space and the pseudo-label space, resulting in coarse associations. In response, we introduce a Modality-Unified Label Transfer (MULT) module that simultaneously accounts for both homogeneous and heterogeneous fine-grained instance-level structures, yielding high-quality cross-modality label associations. It models both homogeneous and heterogeneous affinities, leveraging them to quantify the inconsistency between the pseudo-label space and the feature space, subsequently minimizing it. The proposed MULT ensures that the generated pseudo-labels maintain alignment across modalities while upholding structural consistency within intra-modality. Additionally, a straightforward plug-and-play Online Cross-memory Label Refinement (OCLR) module is proposed to further mitigate the side effects of noisy pseudo-labels while simultaneously aligning different modalities, coupled with an Alternative Modality-Invariant Representation Learning (AMIRL) framework. Experiments demonstrate that our proposed method outperforms existing state-of-the-art USL-VI-ReID methods, highlighting the superiority of our MULT in comparison to other cross-modality association methods. Code is available at https://github.com/FranklinLingfeng/code_for_MULT.

无监督可见红外人再识别（USL-VI-ReID）试图在没有注释的情况下从不同的模态检索相同身份的行人图像。虽然先前的工作侧重于建立跨模态伪标签关联来弥合模态差距，但他们忽略了保持特征空间和伪标签空间之间的实例级同质和异构一致性，导致粗糙的关联。作为回应，我们引入了模态统一标签传输（MULT）模块，该模块同时考虑同质和异构细粒度实例级结构，从而产生高质量的跨模态标签关联。它对同构和异构亲缘关系建模，利用它们来量化伪标签空间和特征空间之间的不一致性，随后将其最小化。提议的MULT确保生成的伪标签在模态之间保持对齐，同时在模态内部保持结构一致性。此外，提出了一个简单的即插即用的在线跨内存标签细化（OCLR）模块，以进一步减轻噪声伪标签的副作用，同时对齐不同的模态，再加上替代模态不变表示学习（AMIRL）框架。实验表明，我们提出的方法优于现有的最先进的USL-VI-ReID方法，与其他跨模态关联方法相比，突出了我们的MULT的优越性。代码可从https://github.com/FranklinLingfeng/code_for_MULT获得。

{"title":"Exploring Homogeneous and Heterogeneous Consistent Label Associations for Unsupervised Visible-Infrared Person ReID","authors":"Lingfeng He, De Cheng, Nannan Wang, Xinbo Gao","doi":"10.1007/s11263-024-02322-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02322-1","url":null,"abstract":"Unsupervised visible-infrared person re-identification (USL-VI-ReID) endeavors to retrieve pedestrian images of the same identity from different modalities without annotations. While prior work focuses on establishing cross-modality pseudo-label associations to bridge the modality-gap, they ignore maintaining the instance-level homogeneous and heterogeneous consistency between the feature space and the pseudo-label space, resulting in coarse associations. In response, we introduce a Modality-Unified Label Transfer (MULT) module that simultaneously accounts for both homogeneous and heterogeneous fine-grained instance-level structures, yielding high-quality cross-modality label associations. It models both homogeneous and heterogeneous affinities, leveraging them to quantify the inconsistency between the pseudo-label space and the feature space, subsequently minimizing it. The proposed MULT ensures that the generated pseudo-labels maintain alignment across modalities while upholding structural consistency within intra-modality. Additionally, a straightforward plug-and-play Online Cross-memory Label Refinement (OCLR) module is proposed to further mitigate the side effects of noisy pseudo-labels while simultaneously aligning different modalities, coupled with an Alternative Modality-Invariant Representation Learning (AMIRL) framework. Experiments demonstrate that our proposed method outperforms existing state-of-the-art USL-VI-ReID methods, highlighting the superiority of our MULT in comparison to other cross-modality association methods. Code is available at https://github.com/FranklinLingfeng/code_for_MULT.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"25 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142886995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SLIDE: A Unified Mesh and Texture Generation Framework with Enhanced Geometric Control and Multi-view Consistency 一个统一的网格和纹理生成框架，增强几何控制和多视图一致性

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-23 DOI: 10.1007/s11263-024-02326-x

Jinyi Wang, Zhaoyang Lyu, Ben Fei, Jiangchao Yao, Ya Zhang, Bo Dai, Dahua Lin, Ying He, Yanfeng Wang

The generation of textured mesh is crucial for computer graphics and virtual content creation. However, current generative models often struggle with challenges such as irregular mesh structures and inconsistencies in multi-view textures. In this study, we present a unified framework for both geometry generation and texture generation, utilizing a novel sparse latent point diffusion model that specifically addresses the geometric aspects of models. Our approach employs point clouds as an efficient intermediate representation, encoding them into sparse latent points with semantically meaningful features for precise geometric control. While the sparse latent points facilitate a high-level control over the geometry, shaping the overall structure and fine details of the meshes, this control does not extend to textures. To address this, we propose a separate texture generation process that integrates multi-view priors post-geometry generation, effectively resolving the issue of multi-view texture inconsistency. This process ensures the production of coherent and high-quality textures that complement the precisely generated meshes, thereby creating visually appealing and detailed models. Our framework distinctively separates the control mechanisms for geometry and texture, leading to significant improvements in the generation of complex, textured 3D content. Evaluations on the ShapeNet dataset for geometry and the Objaverse dataset for textures demonstrate that our model surpasses existing methods in terms of geometric quality, control, and the generation of coherent, high-quality textures.

纹理网格的生成对于计算机图形学和虚拟内容的创建至关重要。然而，当前的生成模型经常面临诸如不规则网格结构和多视图纹理不一致等挑战。在这项研究中，我们提出了一个统一的框架，用于几何生成和纹理生成，利用一个新的稀疏潜点扩散模型，专门解决模型的几何方面。我们的方法采用点云作为有效的中间表示，将其编码为具有语义有意义特征的稀疏潜在点，以实现精确的几何控制。虽然稀疏的隐点有助于对几何结构的高级控制，塑造网格的整体结构和精细细节，但这种控制不能扩展到纹理。为了解决这个问题，我们提出了一个单独的纹理生成过程，该过程集成了多视图先验后几何生成，有效地解决了多视图纹理不一致的问题。这个过程确保生产连贯和高质量的纹理，以补充精确生成的网格，从而创建视觉上吸引人的和详细的模型。我们的框架独特地分离了几何和纹理的控制机制，从而显著改善了复杂的纹理3D内容的生成。对ShapeNet的几何数据集和Objaverse的纹理数据集的评估表明，我们的模型在几何质量、控制和生成连贯、高质量的纹理方面超越了现有的方法。

{"title":"SLIDE: A Unified Mesh and Texture Generation Framework with Enhanced Geometric Control and Multi-view Consistency","authors":"Jinyi Wang, Zhaoyang Lyu, Ben Fei, Jiangchao Yao, Ya Zhang, Bo Dai, Dahua Lin, Ying He, Yanfeng Wang","doi":"10.1007/s11263-024-02326-x","DOIUrl":"https://doi.org/10.1007/s11263-024-02326-x","url":null,"abstract":"The generation of textured mesh is crucial for computer graphics and virtual content creation. However, current generative models often struggle with challenges such as irregular mesh structures and inconsistencies in multi-view textures. In this study, we present a unified framework for both geometry generation and texture generation, utilizing a novel sparse latent point diffusion model that specifically addresses the geometric aspects of models. Our approach employs point clouds as an efficient intermediate representation, encoding them into sparse latent points with semantically meaningful features for precise geometric control. While the sparse latent points facilitate a high-level control over the geometry, shaping the overall structure and fine details of the meshes, this control does not extend to textures. To address this, we propose a separate texture generation process that integrates multi-view priors post-geometry generation, effectively resolving the issue of multi-view texture inconsistency. This process ensures the production of coherent and high-quality textures that complement the precisely generated meshes, thereby creating visually appealing and detailed models. Our framework distinctively separates the control mechanisms for geometry and texture, leading to significant improvements in the generation of complex, textured 3D content. Evaluations on the ShapeNet dataset for geometry and the Objaverse dataset for textures demonstrate that our model surpasses existing methods in terms of geometric quality, control, and the generation of coherent, high-quality textures.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"64 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142874200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FusionBooster: A Unified Image Fusion Boosting Paradigm FusionBooster：一个统一的图像融合增强范例

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-23 DOI: 10.1007/s11263-024-02266-6

Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Hui Li, Xi Li, Josef Kittler

In recent years, numerous ideas have emerged for designing a mutually reinforcing mechanism or extra stages for the image fusion task, ignoring the inevitable gaps between different vision tasks and the computational burden. We argue that there is a scope to improve the fusion performance with the help of the FusionBooster, a model specifically designed for fusion tasks. In particular, our booster is based on the divide-and-conquer strategy controlled by an information probe. The booster is composed of three building blocks: the probe units, the booster layer, and the assembling module. Given the result produced by a backbone method, the probe units assess the fused image and divide the results according to their information content. This is instrumental in identifying missing information, as a step to its recovery. The recovery of the degraded components along with the fusion guidance are the role of the booster layer. Lastly, the assembling module is responsible for piecing these advanced components together to deliver the output. We use concise reconstruction loss functions in conjunction with lightweight autoencoder models to formulate the learning task, with marginal computational complexity increase. The experimental results obtained in various fusion missions, as well as downstream detection tasks, consistently demonstrate that the proposed FusionBooster significantly improves the performance. Our code will be publicly available at https://github.com/AWCXV/FusionBooster.

近年来，出现了许多为图像融合任务设计相互增强机制或额外阶段的想法，忽略了不同视觉任务之间不可避免的差距和计算负担。我们认为，在FusionBooster（一个专门为融合任务设计的模型）的帮助下，有一个改善融合性能的范围。特别是，我们的助推器是基于信息探针控制的分而治之策略。该助推器由三个构件组成：探头单元、助推器层和装配模块。根据骨干方法产生的结果，探针单元对融合后的图像进行评估，并根据其信息含量对结果进行划分。这有助于识别丢失的信息，作为恢复信息的一步。助推层的作用是对降解部件的回收和融合制导。最后，组装模块负责将这些高级组件拼接在一起以提供输出。我们使用简洁的重建损失函数结合轻量级自编码器模型来制定学习任务，边际计算复杂度增加。在各种聚变任务以及下游检测任务中获得的实验结果一致表明，所提出的FusionBooster显著提高了性能。我们的代码将在https://github.com/AWCXV/FusionBooster上公开提供。

{"title":"FusionBooster: A Unified Image Fusion Boosting Paradigm","authors":"Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Hui Li, Xi Li, Josef Kittler","doi":"10.1007/s11263-024-02266-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02266-6","url":null,"abstract":"In recent years, numerous ideas have emerged for designing a mutually reinforcing mechanism or extra stages for the image fusion task, ignoring the inevitable gaps between different vision tasks and the computational burden. We argue that there is a scope to improve the fusion performance with the help of the FusionBooster, a model specifically designed for fusion tasks. In particular, our booster is based on the divide-and-conquer strategy controlled by an information probe. The booster is composed of three building blocks: the probe units, the booster layer, and the assembling module. Given the result produced by a backbone method, the probe units assess the fused image and divide the results according to their information content. This is instrumental in identifying missing information, as a step to its recovery. The recovery of the degraded components along with the fusion guidance are the role of the booster layer. Lastly, the assembling module is responsible for piecing these advanced components together to deliver the output. We use concise reconstruction loss functions in conjunction with lightweight autoencoder models to formulate the learning task, with marginal computational complexity increase. The experimental results obtained in various fusion missions, as well as downstream detection tasks, consistently demonstrate that the proposed FusionBooster significantly improves the performance. Our code will be publicly available at https://github.com/AWCXV/FusionBooster.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142874202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models LaVie：具有级联潜在扩散模型的高质量视频生成

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-23 DOI: 10.1007/s11263-024-02295-1

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.

本研究旨在利用预训练的文本到图像（T2I）模型为基础，学习高质量的文本到视频（T2V）生成模型。这是一项非常理想但具有挑战性的任务，同时(a)完成视觉逼真和时间连贯视频的合成，同时(b)保留预训练T2I模型的强大创意生成性质。为此，我们提出了LaVie，这是一个集成的视频生成框架，用于级联视频潜在扩散模型，包括基本T2V模型、时间插值模型和视频超分辨率模型。我们的主要见解有两个方面：(1)我们揭示了简单的时间自关注的结合，加上旋转位置编码，充分捕捉了视频数据中固有的时间相关性。(2)此外，我们验证了图像-视频联合微调过程在产生高质量和创造性成果方面起着关键作用。为了提高LaVie的性能，我们提供了一个全面而多样化的视频数据集，名为Vimeo25M，由2500万文本视频对组成，优先考虑质量、多样性和审美吸引力。大量的实验表明，LaVie在定量和定性上都达到了最先进的性能。此外，我们展示了预训练LaVie模型在各种长视频生成和个性化视频合成应用中的多功能性。项目页面：https://github.com/Vchitect/LaVie/。

{"title":"LaVie: High-Quality Video Generation with Cascaded Latent Diffusion Models","authors":"Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu","doi":"10.1007/s11263-024-02295-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02295-1","url":null,"abstract":"This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously (a) accomplish the synthesis of visually realistic and temporally coherent videos while (b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: (1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. (2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications. Project page: https://github.com/Vchitect/LaVie/.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142874201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AutoStory: Generating Diverse Storytelling Images with Minimal Human Efforts AutoStory：用最少的人力产生不同的故事图像

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2024-12-23 DOI: 10.1007/s11263-024-02309-y

Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen

Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches, and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images. This allows our method to obtain consistent story visualization even when only texts are provided as input. Both qualitative and quantitative experiments demonstrate the superiority of our method.

故事可视化旨在生成一系列与文本描述的故事相匹配的图像，要求生成的图像质量高、与文本描述一致、人物身份一致。考虑到故事可视化的复杂性，现有的方法通过只考虑几个特定的角色和场景，或者要求用户提供每个图像的控制条件（如草图），大大简化了问题。然而，这些简化使得这些方法不适合实际应用。为此，我们提出了一个自动化的故事可视化系统，该系统可以有效地生成多样化、高质量和一致的故事图像集，而人工交互最少。具体来说，我们利用大型语言模型的理解和规划能力来进行布局规划，然后利用大规模的文本到图像模型来生成基于布局的复杂故事图像。我们的经验发现，稀疏控制条件（如边界框）适用于布局规划，而密集控制条件（如草图和关键点）适用于生成高质量的图像内容。为了获得两全其美，我们设计了一个密集的条件生成模块，将简单的边界框布局转换为最终图像生成的草图或关键点控制条件，不仅提高了图像质量，而且允许简单直观的用户交互。此外，我们提出了一种简单而有效的方法来生成多视图一致的字符图像，消除了对人工采集或绘制字符图像的依赖。这允许我们的方法获得一致的故事可视化，即使只提供文本作为输入。定性和定量实验均证明了该方法的优越性。

{"title":"AutoStory: Generating Diverse Storytelling Images with Minimal Human Efforts","authors":"Wen Wang, Canyu Zhao, Hao Chen, Zhekai Chen, Kecheng Zheng, Chunhua Shen","doi":"10.1007/s11263-024-02309-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02309-y","url":null,"abstract":"Story visualization aims to generate a series of images that match the story described in texts, and it requires the generated images to satisfy high quality, alignment with the text description, and consistency in character identities. Given the complexity of story visualization, existing methods drastically simplify the problem by considering only a few specific characters and scenarios, or requiring the users to provide per-image control conditions such as sketches. However, these simplifications render these methods incompetent for real applications. To this end, we propose an automated story visualization system that can effectively generate diverse, high-quality, and consistent sets of story images, with minimal human interactions. Specifically, we utilize the comprehension and planning capabilities of large language models for layout planning, and then leverage large-scale text-to-image models to generate sophisticated story images based on the layout. We empirically find that sparse control conditions, such as bounding boxes, are suitable for layout planning, while dense control conditions, e.g., sketches, and keypoints, are suitable for generating high-quality image content. To obtain the best of both worlds, we devise a dense condition generation module to transform simple bounding box layouts into sketch or keypoint control conditions for final image generation, which not only improves the image quality but also allows easy and intuitive user interactions. In addition, we propose a simple yet effective method to generate multi-view consistent character images, eliminating the reliance on human labor to collect or draw character images. This allows our method to obtain consistent story visualization even when only texts are provided as input. Both qualitative and quantitative experiments demonstrate the superiority of our method.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142874204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0