Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
{"title":"通过基于 CLIP 的直接优化重新审视图像字幕培训范式","authors":"Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara","doi":"arxiv-2408.14547","DOIUrl":null,"url":null,"abstract":"The conventional training approach for image captioning involves pre-training\na network using teacher forcing and subsequent fine-tuning with Self-Critical\nSequence Training to maximize hand-crafted captioning metrics. However, when\nattempting to optimize modern and higher-quality metrics like CLIP-Score and\nPAC-Score, this training method often encounters instability and fails to\nacquire the genuine descriptive capabilities needed to produce fluent and\ninformative captions. In this paper, we propose a new training paradigm termed\nDirect CLIP-Based Optimization (DiCO). Our approach jointly learns and\noptimizes a reward model that is distilled from a learnable captioning\nevaluator with high human correlation. This is done by solving a weighted\nclassification problem directly inside the captioner. At the same time, DiCO\nprevents divergence from the original model, ensuring that fluency is\nmaintained. DiCO not only exhibits improved stability and enhanced quality in\nthe generated captions but also aligns more closely with human preferences\ncompared to existing methods, especially in modern metrics. Additionally, it\nmaintains competitive performance in traditional metrics. Our source code and\ntrained models are publicly available at https://github.com/aimagelab/DiCO.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"67 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization\",\"authors\":\"Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara\",\"doi\":\"arxiv-2408.14547\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The conventional training approach for image captioning involves pre-training\\na network using teacher forcing and subsequent fine-tuning with Self-Critical\\nSequence Training to maximize hand-crafted captioning metrics. However, when\\nattempting to optimize modern and higher-quality metrics like CLIP-Score and\\nPAC-Score, this training method often encounters instability and fails to\\nacquire the genuine descriptive capabilities needed to produce fluent and\\ninformative captions. In this paper, we propose a new training paradigm termed\\nDirect CLIP-Based Optimization (DiCO). Our approach jointly learns and\\noptimizes a reward model that is distilled from a learnable captioning\\nevaluator with high human correlation. This is done by solving a weighted\\nclassification problem directly inside the captioner. At the same time, DiCO\\nprevents divergence from the original model, ensuring that fluency is\\nmaintained. DiCO not only exhibits improved stability and enhanced quality in\\nthe generated captions but also aligns more closely with human preferences\\ncompared to existing methods, especially in modern metrics. Additionally, it\\nmaintains competitive performance in traditional metrics. Our source code and\\ntrained models are publicly available at https://github.com/aimagelab/DiCO.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"67 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.14547\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.14547","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization
The conventional training approach for image captioning involves pre-training
a network using teacher forcing and subsequent fine-tuning with Self-Critical
Sequence Training to maximize hand-crafted captioning metrics. However, when
attempting to optimize modern and higher-quality metrics like CLIP-Score and
PAC-Score, this training method often encounters instability and fails to
acquire the genuine descriptive capabilities needed to produce fluent and
informative captions. In this paper, we propose a new training paradigm termed
Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and
optimizes a reward model that is distilled from a learnable captioning
evaluator with high human correlation. This is done by solving a weighted
classification problem directly inside the captioner. At the same time, DiCO
prevents divergence from the original model, ensuring that fluency is
maintained. DiCO not only exhibits improved stability and enhanced quality in
the generated captions but also aligns more closely with human preferences
compared to existing methods, especially in modern metrics. Additionally, it
maintains competitive performance in traditional metrics. Our source code and
trained models are publicly available at https://github.com/aimagelab/DiCO.