利用预先训练的模型来解读艺术

2023 18th International Conference on Machine Vision and Applications (MVA) Pub Date : 2023-07-23 DOI:10.23919/MVA57639.2023.10216010

Niklas Penzel, J. Denzler

{"title":"利用预先训练的模型来解读艺术","authors":"Niklas Penzel, J. Denzler","doi":"10.23919/MVA57639.2023.10216010","DOIUrl":null,"url":null,"abstract":"In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models1. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Interpreting Art by Leveraging Pre-Trained Models\",\"authors\":\"Niklas Penzel, J. Denzler\",\"doi\":\"10.23919/MVA57639.2023.10216010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models1. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.\",\"PeriodicalId\":338734,\"journal\":{\"name\":\"2023 18th International Conference on Machine Vision and Applications (MVA)\",\"volume\":\"92 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 18th International Conference on Machine Vision and Applications (MVA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/MVA57639.2023.10216010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 18th International Conference on Machine Vision and Applications (MVA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/MVA57639.2023.10216010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在许多领域，最近提出了所谓的基础模型。这些模型在大量数据上进行训练，从而在各种下游任务和基准测试中获得令人印象深刻的性能。后来的工作重点是通过组合这些模型来利用这些预训练的知识。为了减少数据和计算需求，我们以两种方式利用和组合基础模型。首先，我们使用语言和视觉模型提取并生成一个具有挑战性的语言视觉任务，以艺术品解释对的形式。其次，我们结合并微调CLIP和GPT-2，以减少训练解释模型的计算需求。我们对我们的数据进行了定性和定量分析，并得出结论，生成艺术作品可以改善视觉文本对齐，从而产生更熟练的解释模型1。我们的方法解决了如何利用和组合预训练模型来处理现有数据稀缺或难以获得的任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Interpreting Art by Leveraging Pre-Trained Models

In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models1. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 18th International Conference on Machine Vision and Applications (MVA)

自引率

0.00%

发文量