Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing.

Yipo Huang, Leida Li, Pengfei Chen, Haoning Wu, Weisi Lin, Guangming Shi
{"title":"Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing.","authors":"Yipo Huang, Leida Li, Pengfei Chen, Haoning Wu, Weisi Lin, Guangming Shi","doi":"10.1109/TPAMI.2024.3492259","DOIUrl":null,"url":null,"abstract":"<p><p>In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. (1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. (2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks. The code, database and pre-trained weights will be available at https://github.com/yipoh/AesNet.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3492259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. (1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. (2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks. The code, database and pre-trained weights will be available at https://github.com/yipoh/AesNet.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
图像美学计算的多模态多属性对比预训练
在图像美学计算(IAC)领域,之前的大多数方法都是利用在大型 ImageNet 数据库上预先训练好的现成骨干。虽然这些预训练骨干取得了显著的成功,但它们往往过于强调对象层面的语义,而未能捕捉到图像美学的高层次概念,因此可能只能达到次优的性能。为了解决这一长期被忽视的问题,我们提出了一种多模态多属性对比预训练框架,旨在为 IAC 构建一种基于 ImageNet 的预训练替代方案。具体来说,我们提出的框架包括两个主要方面。(1) 我们利用多模态大语言模型的图像理解能力来生成丰富的审美描述,从而建立一个多属性图像描述数据库。(2)为了使模型更好地适应审美计算任务,我们将基于图像的视觉特征与基于属性的文本特征进行了整合,并将整合后的特征映射到不同的嵌入空间,在此基础上提出了多属性对比学习,以获得更全面的审美表征。为了缓解从一般视觉领域过渡到审美领域时遇到的分布偏移问题,我们进一步提出了语义亲和力损失来抑制内容信息,增强模型的泛化能力。广泛的实验证明,所提出的框架为 IAC 任务树立了新的艺术典范。有关代码、数据库和预训练权重,请访问 https://github.com/yipoh/AesNet。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
EBMGC-GNF: Efficient Balanced Multi-View Graph Clustering via Good Neighbor Fusion. Integrating Neural-Symbolic Reasoning With Variational Causal Inference Network for Explanatory Visual Question Answering. Motion-Aware Dynamic Graph Neural Network for Video Compressive Sensing. Evaluation Metrics for Intelligent Generation of Graphical Game Assets: A Systematic Survey-Based Framework. Artificial Intelligence and Machine Learning Tools for Improving Early Warning Systems of Volcanic Eruptions: The Case of Stromboli.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1