Text-to-image models are increasingly applied to human image generation, leveraging multimodal information under multiple conditions to produce high-quality human images. Despite their ability to generate detailed images, these models often struggle to maintain perceptual consistency across multiple viewpoints. To address this limitation, we propose Multi-View Human Diffusion (MVHDiff), a novel framework that integrates 3D human model priors and text prompts to generate high-quality, multi-view-consistent human images. MVHDiff separately acquires textual descriptions of human appearance and pose, as well as spatial information regarding the subject’s orientation relative to the camera. Subsequently, a perceptual fusion module is employed to align these text features with the visual features extracted from the human image, thereby enabling the fused learning of prior information and image features. Further, MVHDiff finetunes both appearance descriptions and spatial viewpoint-related textual inputs, enabling precise text-based control over human attributes while ensuring semantic consistency across different spatial viewpoints. Experimental results demonstrate that MVHDiff significantly outperforms existing methods in generating text-guided human attributes with consistent multi-view representations, offering a robust solution for high-quality, text-driven human image generation.
扫码关注我们
求助内容:
应助结果提醒方式:
