{"title":"Hand1000:仅用 1,000 张图片就能从文本生成逼真的手部图像","authors":"Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao","doi":"arxiv-2408.15461","DOIUrl":null,"url":null,"abstract":"Text-to-image generation models have achieved remarkable advancements in\nrecent years, aiming to produce realistic images from textual descriptions.\nHowever, these models often struggle with generating anatomically accurate\nrepresentations of human hands. The resulting images frequently exhibit issues\nsuch as incorrect numbers of fingers, unnatural twisting or interlacing of\nfingers, or blurred and indistinct hands. These issues stem from the inherent\ncomplexity of hand structures and the difficulty in aligning textual\ndescriptions with precise visual depictions of hands. To address these\nchallenges, we propose a novel approach named Hand1000 that enables the\ngeneration of realistic hand images with target gesture using only 1,000\ntraining samples. The training of Hand1000 is divided into three stages with\nthe first stage aiming to enhance the model's understanding of hand anatomy by\nusing a pre-trained hand gesture recognition model to extract gesture\nrepresentation. The second stage further optimizes text embedding by\nincorporating the extracted hand gesture representation, to improve alignment\nbetween the textual descriptions and the generated hand images. The third stage\nutilizes the optimized embedding to fine-tune the Stable Diffusion model to\ngenerate realistic hand images. In addition, we construct the first publicly\navailable dataset specifically designed for text-to-hand image generation.\nBased on the existing hand gesture recognition dataset, we adopt advanced image\ncaptioning models and LLaMA3 to generate high-quality textual descriptions\nenriched with detailed gesture information. Extensive experiments demonstrate\nthat Hand1000 significantly outperforms existing models in producing\nanatomically correct hand images while faithfully representing other details in\nthe text, such as faces, clothing, and colors.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Hand1000: Generating Realistic Hands from Text with Only 1,000 Images\",\"authors\":\"Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao\",\"doi\":\"arxiv-2408.15461\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Text-to-image generation models have achieved remarkable advancements in\\nrecent years, aiming to produce realistic images from textual descriptions.\\nHowever, these models often struggle with generating anatomically accurate\\nrepresentations of human hands. The resulting images frequently exhibit issues\\nsuch as incorrect numbers of fingers, unnatural twisting or interlacing of\\nfingers, or blurred and indistinct hands. These issues stem from the inherent\\ncomplexity of hand structures and the difficulty in aligning textual\\ndescriptions with precise visual depictions of hands. To address these\\nchallenges, we propose a novel approach named Hand1000 that enables the\\ngeneration of realistic hand images with target gesture using only 1,000\\ntraining samples. The training of Hand1000 is divided into three stages with\\nthe first stage aiming to enhance the model's understanding of hand anatomy by\\nusing a pre-trained hand gesture recognition model to extract gesture\\nrepresentation. The second stage further optimizes text embedding by\\nincorporating the extracted hand gesture representation, to improve alignment\\nbetween the textual descriptions and the generated hand images. The third stage\\nutilizes the optimized embedding to fine-tune the Stable Diffusion model to\\ngenerate realistic hand images. In addition, we construct the first publicly\\navailable dataset specifically designed for text-to-hand image generation.\\nBased on the existing hand gesture recognition dataset, we adopt advanced image\\ncaptioning models and LLaMA3 to generate high-quality textual descriptions\\nenriched with detailed gesture information. Extensive experiments demonstrate\\nthat Hand1000 significantly outperforms existing models in producing\\nanatomically correct hand images while faithfully representing other details in\\nthe text, such as faces, clothing, and colors.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.15461\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.15461","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Hand1000: Generating Realistic Hands from Text with Only 1,000 Images
Text-to-image generation models have achieved remarkable advancements in
recent years, aiming to produce realistic images from textual descriptions.
However, these models often struggle with generating anatomically accurate
representations of human hands. The resulting images frequently exhibit issues
such as incorrect numbers of fingers, unnatural twisting or interlacing of
fingers, or blurred and indistinct hands. These issues stem from the inherent
complexity of hand structures and the difficulty in aligning textual
descriptions with precise visual depictions of hands. To address these
challenges, we propose a novel approach named Hand1000 that enables the
generation of realistic hand images with target gesture using only 1,000
training samples. The training of Hand1000 is divided into three stages with
the first stage aiming to enhance the model's understanding of hand anatomy by
using a pre-trained hand gesture recognition model to extract gesture
representation. The second stage further optimizes text embedding by
incorporating the extracted hand gesture representation, to improve alignment
between the textual descriptions and the generated hand images. The third stage
utilizes the optimized embedding to fine-tune the Stable Diffusion model to
generate realistic hand images. In addition, we construct the first publicly
available dataset specifically designed for text-to-hand image generation.
Based on the existing hand gesture recognition dataset, we adopt advanced image
captioning models and LLaMA3 to generate high-quality textual descriptions
enriched with detailed gesture information. Extensive experiments demonstrate
that Hand1000 significantly outperforms existing models in producing
anatomically correct hand images while faithfully representing other details in
the text, such as faces, clothing, and colors.