Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou
{"title":"要点:用经济实惠的策略改进您的视觉语言模式","authors":"Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou","doi":"arxiv-2409.04828","DOIUrl":null,"url":null,"abstract":"In recent years, vision-language models have made significant strides,\nexcelling in tasks like optical character recognition and geometric\nproblem-solving. However, several critical issues remain: 1) Proprietary models\noften lack transparency about their architectures, while open-source models\nneed more detailed ablations of their training strategies. 2) Pre-training data\nin open-source works is under-explored, with datasets added empirically, making\nthe process cumbersome. 3) Fine-tuning often focuses on adding datasets,\nleading to diminishing returns. To address these issues, we propose the\nfollowing contributions: 1) We trained a robust baseline model using the latest\nadvancements in vision-language models, introducing effective improvements and\nconducting comprehensive ablation and validation for each technique. 2)\nInspired by recent work on large language models, we filtered pre-training data\nusing perplexity, selecting the lowest perplexity data for training. This\napproach allowed us to train on a curated 1M dataset, achieving competitive\nperformance. 3) During visual instruction tuning, we used model soup on\ndifferent datasets when adding more datasets yielded marginal improvements.\nThese innovations resulted in a 9B parameter model that performs competitively\nwith state-of-the-art models. Our strategies are efficient and lightweight,\nmaking them easily adoptable by the community.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"POINTS: Improving Your Vision-language Model with Affordable Strategies\",\"authors\":\"Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou\",\"doi\":\"arxiv-2409.04828\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, vision-language models have made significant strides,\\nexcelling in tasks like optical character recognition and geometric\\nproblem-solving. However, several critical issues remain: 1) Proprietary models\\noften lack transparency about their architectures, while open-source models\\nneed more detailed ablations of their training strategies. 2) Pre-training data\\nin open-source works is under-explored, with datasets added empirically, making\\nthe process cumbersome. 3) Fine-tuning often focuses on adding datasets,\\nleading to diminishing returns. To address these issues, we propose the\\nfollowing contributions: 1) We trained a robust baseline model using the latest\\nadvancements in vision-language models, introducing effective improvements and\\nconducting comprehensive ablation and validation for each technique. 2)\\nInspired by recent work on large language models, we filtered pre-training data\\nusing perplexity, selecting the lowest perplexity data for training. This\\napproach allowed us to train on a curated 1M dataset, achieving competitive\\nperformance. 3) During visual instruction tuning, we used model soup on\\ndifferent datasets when adding more datasets yielded marginal improvements.\\nThese innovations resulted in a 9B parameter model that performs competitively\\nwith state-of-the-art models. Our strategies are efficient and lightweight,\\nmaking them easily adoptable by the community.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.04828\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04828","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
POINTS: Improving Your Vision-language Model with Affordable Strategies
In recent years, vision-language models have made significant strides,
excelling in tasks like optical character recognition and geometric
problem-solving. However, several critical issues remain: 1) Proprietary models
often lack transparency about their architectures, while open-source models
need more detailed ablations of their training strategies. 2) Pre-training data
in open-source works is under-explored, with datasets added empirically, making
the process cumbersome. 3) Fine-tuning often focuses on adding datasets,
leading to diminishing returns. To address these issues, we propose the
following contributions: 1) We trained a robust baseline model using the latest
advancements in vision-language models, introducing effective improvements and
conducting comprehensive ablation and validation for each technique. 2)
Inspired by recent work on large language models, we filtered pre-training data
using perplexity, selecting the lowest perplexity data for training. This
approach allowed us to train on a curated 1M dataset, achieving competitive
performance. 3) During visual instruction tuning, we used model soup on
different datasets when adding more datasets yielded marginal improvements.
These innovations resulted in a 9B parameter model that performs competitively
with state-of-the-art models. Our strategies are efficient and lightweight,
making them easily adoptable by the community.