{"title":"用风格控制:基于风格嵌入的变异自动编码器用于受控风格化标题生成框架","authors":"Dhruv Sharma;Chhavi Dhiman;Dinesh Kumar","doi":"10.1109/TCDS.2024.3405573","DOIUrl":null,"url":null,"abstract":"Automatic image captioning is a computationally intensive and structurally complicated task that describes the contents of an image in the form of a natural language sentence. Methods developed in the recent past focused mainly on the description of factual content in images thereby ignoring the different emotions and styles (romantic, humorous, angry, etc.) associated with the image. To overcome this, few works incorporated style-based caption generation that captures the variability in the generated descriptions. This article presents a style embedding-based variational autoencoder for controlled stylized caption generation framework (RFCG+SE-VAE-CSCG). It generates controlled text-based stylized descriptions of images. It works in two phases, i.e., \n<inline-formula><tex-math>$ 1)$</tex-math></inline-formula>\n refined factual caption generation (RFCG); and \n<inline-formula><tex-math>$ 2)$</tex-math></inline-formula>\n SE-VAE-CSCG. The former defines an encoder–decoder model for the generation of refined factual captions. Whereas, the latter presents a SE-VAE for controlled stylized caption generation. The overall proposed framework generates style-based descriptions of images by leveraging bag of captions (BoCs). More so, with the use of a controlled text generation model, the proposed work efficiently learns disentangled representations and generates realistic stylized descriptions of images. Experiments on MSCOCO, Flickr30K, and FlickrStyle10K provide state-of-the-art results for both refined and style-based caption generation, supported with an ablation study.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 6","pages":"2032-2042"},"PeriodicalIF":5.0000,"publicationDate":"2024-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Control With Style: Style Embedding-Based Variational Autoencoder for Controlled Stylized Caption Generation Framework\",\"authors\":\"Dhruv Sharma;Chhavi Dhiman;Dinesh Kumar\",\"doi\":\"10.1109/TCDS.2024.3405573\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic image captioning is a computationally intensive and structurally complicated task that describes the contents of an image in the form of a natural language sentence. Methods developed in the recent past focused mainly on the description of factual content in images thereby ignoring the different emotions and styles (romantic, humorous, angry, etc.) associated with the image. To overcome this, few works incorporated style-based caption generation that captures the variability in the generated descriptions. This article presents a style embedding-based variational autoencoder for controlled stylized caption generation framework (RFCG+SE-VAE-CSCG). It generates controlled text-based stylized descriptions of images. It works in two phases, i.e., \\n<inline-formula><tex-math>$ 1)$</tex-math></inline-formula>\\n refined factual caption generation (RFCG); and \\n<inline-formula><tex-math>$ 2)$</tex-math></inline-formula>\\n SE-VAE-CSCG. The former defines an encoder–decoder model for the generation of refined factual captions. Whereas, the latter presents a SE-VAE for controlled stylized caption generation. The overall proposed framework generates style-based descriptions of images by leveraging bag of captions (BoCs). More so, with the use of a controlled text generation model, the proposed work efficiently learns disentangled representations and generates realistic stylized descriptions of images. Experiments on MSCOCO, Flickr30K, and FlickrStyle10K provide state-of-the-art results for both refined and style-based caption generation, supported with an ablation study.\",\"PeriodicalId\":54300,\"journal\":{\"name\":\"IEEE Transactions on Cognitive and Developmental Systems\",\"volume\":\"16 6\",\"pages\":\"2032-2042\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-03-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cognitive and Developmental Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10542089/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10542089/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Control With Style: Style Embedding-Based Variational Autoencoder for Controlled Stylized Caption Generation Framework
Automatic image captioning is a computationally intensive and structurally complicated task that describes the contents of an image in the form of a natural language sentence. Methods developed in the recent past focused mainly on the description of factual content in images thereby ignoring the different emotions and styles (romantic, humorous, angry, etc.) associated with the image. To overcome this, few works incorporated style-based caption generation that captures the variability in the generated descriptions. This article presents a style embedding-based variational autoencoder for controlled stylized caption generation framework (RFCG+SE-VAE-CSCG). It generates controlled text-based stylized descriptions of images. It works in two phases, i.e.,
$ 1)$
refined factual caption generation (RFCG); and
$ 2)$
SE-VAE-CSCG. The former defines an encoder–decoder model for the generation of refined factual captions. Whereas, the latter presents a SE-VAE for controlled stylized caption generation. The overall proposed framework generates style-based descriptions of images by leveraging bag of captions (BoCs). More so, with the use of a controlled text generation model, the proposed work efficiently learns disentangled representations and generates realistic stylized descriptions of images. Experiments on MSCOCO, Flickr30K, and FlickrStyle10K provide state-of-the-art results for both refined and style-based caption generation, supported with an ablation study.
期刊介绍:
The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.