{"title":"TextBoost:通过微调文本编码器实现文本到图像模型的一次性个性化定制","authors":"NaHyeon Park, Kunhee Kim, Hyunjung Shim","doi":"arxiv-2409.08248","DOIUrl":null,"url":null,"abstract":"Recent breakthroughs in text-to-image models have opened up promising\nresearch avenues in personalized image generation, enabling users to create\ndiverse images of a specific subject using natural language prompts. However,\nexisting methods often suffer from performance degradation when given only a\nsingle reference image. They tend to overfit the input, producing highly\nsimilar outputs regardless of the text prompt. This paper addresses the\nchallenge of one-shot personalization by mitigating overfitting, enabling the\ncreation of controllable images through text prompts. Specifically, we propose\na selective fine-tuning strategy that focuses on the text encoder. Furthermore,\nwe introduce three key techniques to enhance personalization performance: (1)\naugmentation tokens to encourage feature disentanglement and alleviate\noverfitting, (2) a knowledge-preservation loss to reduce language drift and\npromote generalizability across diverse prompts, and (3) SNR-weighted sampling\nfor efficient training. Extensive experiments demonstrate that our approach\nefficiently generates high-quality, diverse images using only a single\nreference image while significantly reducing memory and storage requirements.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder\",\"authors\":\"NaHyeon Park, Kunhee Kim, Hyunjung Shim\",\"doi\":\"arxiv-2409.08248\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent breakthroughs in text-to-image models have opened up promising\\nresearch avenues in personalized image generation, enabling users to create\\ndiverse images of a specific subject using natural language prompts. However,\\nexisting methods often suffer from performance degradation when given only a\\nsingle reference image. They tend to overfit the input, producing highly\\nsimilar outputs regardless of the text prompt. This paper addresses the\\nchallenge of one-shot personalization by mitigating overfitting, enabling the\\ncreation of controllable images through text prompts. Specifically, we propose\\na selective fine-tuning strategy that focuses on the text encoder. Furthermore,\\nwe introduce three key techniques to enhance personalization performance: (1)\\naugmentation tokens to encourage feature disentanglement and alleviate\\noverfitting, (2) a knowledge-preservation loss to reduce language drift and\\npromote generalizability across diverse prompts, and (3) SNR-weighted sampling\\nfor efficient training. Extensive experiments demonstrate that our approach\\nefficiently generates high-quality, diverse images using only a single\\nreference image while significantly reducing memory and storage requirements.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08248\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08248","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder
Recent breakthroughs in text-to-image models have opened up promising
research avenues in personalized image generation, enabling users to create
diverse images of a specific subject using natural language prompts. However,
existing methods often suffer from performance degradation when given only a
single reference image. They tend to overfit the input, producing highly
similar outputs regardless of the text prompt. This paper addresses the
challenge of one-shot personalization by mitigating overfitting, enabling the
creation of controllable images through text prompts. Specifically, we propose
a selective fine-tuning strategy that focuses on the text encoder. Furthermore,
we introduce three key techniques to enhance personalization performance: (1)
augmentation tokens to encourage feature disentanglement and alleviate
overfitting, (2) a knowledge-preservation loss to reduce language drift and
promote generalizability across diverse prompts, and (3) SNR-weighted sampling
for efficient training. Extensive experiments demonstrate that our approach
efficiently generates high-quality, diverse images using only a single
reference image while significantly reducing memory and storage requirements.