{"title":"Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks.","authors":"Liam Collins, Hamed Hassani, Mahdi Soltanolkotabi, Aryan Mokhtari, Sanjay Shakkottai","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a <i>single</i> task or (ii) they are <i>linear</i>, very little is known about the closer-to-practice case of <i>nonlinear</i> NNs trained on <i>multiple</i> tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an <math><mi>r</mi></math> -dimensional subspace within the <math><mi>d</mi> <mo>≫</mo> <mi>r</mi></math> -dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of <math><mi>d</mi></math> . In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all <math><mi>r</mi></math> ground-truth features.</p>","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"235 ","pages":"9292-9345"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11486479/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a single task or (ii) they are linear, very little is known about the closer-to-practice case of nonlinear NNs trained on multiple tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an -dimensional subspace within the -dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of . In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all ground-truth features.
一种日益流行的机器学习范式是在许多任务上离线预训练神经网络(NN),然后使其适应下游任务,通常只重新训练网络的最后一层线性层。这种方法在各种情况下都能产生强大的下游性能,证明多任务预训练能带来有效的特征学习。尽管最近的一些理论研究表明,浅层网络在以下两种情况下都能学习到有意义的特征:(i) 在单一任务中训练;(ii) 是线性的,但对于在多个任务中训练的非线性网络这种更贴近实践的情况却知之甚少。在这项研究中,我们首次证明了在多个任务中使用非线性模型进行训练时会出现特征学习。我们的主要见解是,多任务预训练会产生一种伪对比损失,这种损失有利于将通常在不同任务中具有相同标签的点对齐的表征。利用这一观察结果,我们证明,当任务是二元分类任务时,标签取决于数据在 d ≫ r -dimensional 输入空间内的 r -dimensional 子空间上的投影,在双层 ReLU NN 上的基于梯度的简单多任务学习算法可以恢复这一投影,从而在样本和神经元复杂度与 d 无关的情况下泛化到下游任务。与此相反,我们的研究表明,在单个任务的高概率抽取中,对该单个任务的训练无法保证学习到所有 r 个地面真实特征。