利用基于梯度的任务亲和性估计进行可扩展的多任务学习

Dongyue Li, Aneesh Sharma, Hongyang R. Zhang
{"title":"利用基于梯度的任务亲和性估计进行可扩展的多任务学习","authors":"Dongyue Li, Aneesh Sharma, Hongyang R. Zhang","doi":"arxiv-2409.06091","DOIUrl":null,"url":null,"abstract":"Multitask learning is a widely used paradigm for training models on diverse\ntasks, with applications ranging from graph neural networks to language model\nfine-tuning. Since tasks may interfere with each other, a key notion for\nmodeling their relationships is task affinity. This includes pairwise task\naffinity, computed among pairs of tasks, and higher-order affinity, computed\namong subsets of tasks. Naively computing either of them requires repeatedly\ntraining on data from various task combinations, which is computationally\nintensive. We present a new algorithm Grad-TAG that can estimate task\naffinities without this repeated training. The key idea of Grad-TAG is to train a \"base\" model for all tasks and then\nuse a linearization technique to estimate the loss of the model for a specific\ntask combination. The linearization works by computing a gradient-based\napproximation of the loss, using low-dimensional projections of gradients as\nfeatures in a logistic regression to predict labels for the task combination.\nWe show that the linearized model can provably approximate the loss when the\ngradient-based approximation is accurate, and also empirically verify that on\nseveral large models. Then, given the estimated task affinity, we design a\nsemi-definite program for clustering similar tasks by maximizing the average\ndensity of clusters. We evaluate Grad-TAG's performance across seven datasets, including\nmulti-label classification on graphs, and instruction fine-tuning of language\nmodels. Our task affinity estimates are within 2.7% distance to the true\naffinities while needing only 3% of FLOPs in full training. On our largest\ngraph with 21M edges and 500 labeling tasks, our algorithm delivers estimates\nwithin 5% distance to the true affinities, using only 112 GPU hours. Our\nresults show that Grad-TAG achieves excellent performance and runtime tradeoffs\ncompared to existing approaches.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity\",\"authors\":\"Dongyue Li, Aneesh Sharma, Hongyang R. Zhang\",\"doi\":\"arxiv-2409.06091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multitask learning is a widely used paradigm for training models on diverse\\ntasks, with applications ranging from graph neural networks to language model\\nfine-tuning. Since tasks may interfere with each other, a key notion for\\nmodeling their relationships is task affinity. This includes pairwise task\\naffinity, computed among pairs of tasks, and higher-order affinity, computed\\namong subsets of tasks. Naively computing either of them requires repeatedly\\ntraining on data from various task combinations, which is computationally\\nintensive. We present a new algorithm Grad-TAG that can estimate task\\naffinities without this repeated training. The key idea of Grad-TAG is to train a \\\"base\\\" model for all tasks and then\\nuse a linearization technique to estimate the loss of the model for a specific\\ntask combination. The linearization works by computing a gradient-based\\napproximation of the loss, using low-dimensional projections of gradients as\\nfeatures in a logistic regression to predict labels for the task combination.\\nWe show that the linearized model can provably approximate the loss when the\\ngradient-based approximation is accurate, and also empirically verify that on\\nseveral large models. Then, given the estimated task affinity, we design a\\nsemi-definite program for clustering similar tasks by maximizing the average\\ndensity of clusters. We evaluate Grad-TAG's performance across seven datasets, including\\nmulti-label classification on graphs, and instruction fine-tuning of language\\nmodels. Our task affinity estimates are within 2.7% distance to the true\\naffinities while needing only 3% of FLOPs in full training. On our largest\\ngraph with 21M edges and 500 labeling tasks, our algorithm delivers estimates\\nwithin 5% distance to the true affinities, using only 112 GPU hours. Our\\nresults show that Grad-TAG achieves excellent performance and runtime tradeoffs\\ncompared to existing approaches.\",\"PeriodicalId\":501340,\"journal\":{\"name\":\"arXiv - STAT - Machine Learning\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

多任务学习是一种广泛使用的范式,用于在多种任务上训练模型,应用范围从图神经网络到语言模态微调。由于任务之间可能会相互干扰,因此建模它们之间关系的一个关键概念就是任务亲和性。这包括成对任务亲和力(在成对任务之间计算)和高阶亲和力(在任务子集之间计算)。计算这两种亲和力都需要对各种任务组合的数据进行反复训练,计算量非常大。我们提出了一种新算法 Grad-TAG,它可以在不重复训练的情况下估计任务亲和力。Grad-TAG 的主要思路是为所有任务训练一个 "基础 "模型,然后使用线性化技术来估计特定任务组合的模型损失。线性化的工作原理是计算损失的基于梯度的近似值,使用梯度的低维投影作为逻辑回归的特征来预测任务组合的标签。我们证明了当基于梯度的近似值准确时,线性化模型可以近似损失,并在多个大型模型上进行了经验验证。然后,根据估计的任务亲和度,我们设计了一个半定义程序,通过最大化聚类的平均密度对相似任务进行聚类。我们评估了 Grad-TAG 在七个数据集上的性能,包括图的多标签分类和语言模型的指令微调。我们的任务亲和度估计值与真实亲和度的距离在 2.7% 以内,而完全训练只需要 3% 的 FLOPs。在具有 2100 万条边和 500 个标注任务的最大图上,我们的算法得出的估计值与真实亲和度的距离在 5%以内,只用了 112 个 GPU 小时。我们的结果表明,与现有方法相比,Grad-TAG 实现了出色的性能和运行时间折衷。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Scalable Multitask Learning Using Gradient-based Estimation of Task Affinity
Multitask learning is a widely used paradigm for training models on diverse tasks, with applications ranging from graph neural networks to language model fine-tuning. Since tasks may interfere with each other, a key notion for modeling their relationships is task affinity. This includes pairwise task affinity, computed among pairs of tasks, and higher-order affinity, computed among subsets of tasks. Naively computing either of them requires repeatedly training on data from various task combinations, which is computationally intensive. We present a new algorithm Grad-TAG that can estimate task affinities without this repeated training. The key idea of Grad-TAG is to train a "base" model for all tasks and then use a linearization technique to estimate the loss of the model for a specific task combination. The linearization works by computing a gradient-based approximation of the loss, using low-dimensional projections of gradients as features in a logistic regression to predict labels for the task combination. We show that the linearized model can provably approximate the loss when the gradient-based approximation is accurate, and also empirically verify that on several large models. Then, given the estimated task affinity, we design a semi-definite program for clustering similar tasks by maximizing the average density of clusters. We evaluate Grad-TAG's performance across seven datasets, including multi-label classification on graphs, and instruction fine-tuning of language models. Our task affinity estimates are within 2.7% distance to the true affinities while needing only 3% of FLOPs in full training. On our largest graph with 21M edges and 500 labeling tasks, our algorithm delivers estimates within 5% distance to the true affinities, using only 112 GPU hours. Our results show that Grad-TAG achieves excellent performance and runtime tradeoffs compared to existing approaches.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fitting Multilevel Factor Models Cartan moving frames and the data manifolds Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks Recurrent Interpolants for Probabilistic Time Series Prediction PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1