Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li
{"title":"MLLM-FL: Multimodal Large Language Model Assisted Federated Learning on Heterogeneous and Long-tailed Data","authors":"Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li","doi":"arxiv-2409.06067","DOIUrl":null,"url":null,"abstract":"Previous studies on federated learning (FL) often encounter performance\ndegradation due to data heterogeneity among different clients. In light of the\nrecent advances in multimodal large language models (MLLMs), such as GPT-4v and\nLLaVA, which demonstrate their exceptional proficiency in multimodal tasks,\nsuch as image captioning and multimodal question answering. We introduce a\nnovel federated learning framework, named Multimodal Large Language Model\nAssisted Federated Learning (MLLM-FL), which which employs powerful MLLMs at\nthe server end to address the heterogeneous and long-tailed challenges. Owing\nto the advanced cross-modality representation capabilities and the extensive\nopen-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing\nthe extensive, yet previously underexploited, open-source data accessible from\nwebsites and powerful server-side computational resources. Hence, the MLLM-FL\nnot only enhances the performance but also avoids increasing the risk of\nprivacy leakage and the computational burden on local devices, distinguishing\nit from prior methodologies. Our framework has three key stages. Initially,\nprior to local training on local datasets of clients, we conduct global\nvisual-text pretraining of the model. This pretraining is facilitated by\nutilizing the extensive open-source data available online, with the assistance\nof multimodal large language models. Subsequently, the pretrained model is\ndistributed among various clients for local training. Finally, once the locally\ntrained models are transmitted back to the server, a global alignment is\ncarried out under the supervision of MLLMs to further enhance the performance.\nExperimental evaluations on established benchmarks, show that our framework\ndelivers promising performance in the typical scenarios with data heterogeneity\nand long-tail distribution across different clients in FL.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"156 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Previous studies on federated learning (FL) often encounter performance
degradation due to data heterogeneity among different clients. In light of the
recent advances in multimodal large language models (MLLMs), such as GPT-4v and
LLaVA, which demonstrate their exceptional proficiency in multimodal tasks,
such as image captioning and multimodal question answering. We introduce a
novel federated learning framework, named Multimodal Large Language Model
Assisted Federated Learning (MLLM-FL), which which employs powerful MLLMs at
the server end to address the heterogeneous and long-tailed challenges. Owing
to the advanced cross-modality representation capabilities and the extensive
open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing
the extensive, yet previously underexploited, open-source data accessible from
websites and powerful server-side computational resources. Hence, the MLLM-FL
not only enhances the performance but also avoids increasing the risk of
privacy leakage and the computational burden on local devices, distinguishing
it from prior methodologies. Our framework has three key stages. Initially,
prior to local training on local datasets of clients, we conduct global
visual-text pretraining of the model. This pretraining is facilitated by
utilizing the extensive open-source data available online, with the assistance
of multimodal large language models. Subsequently, the pretrained model is
distributed among various clients for local training. Finally, once the locally
trained models are transmitted back to the server, a global alignment is
carried out under the supervision of MLLMs to further enhance the performance.
Experimental evaluations on established benchmarks, show that our framework
delivers promising performance in the typical scenarios with data heterogeneity
and long-tail distribution across different clients in FL.