FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?

IF 4.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Artificial Intelligence Research Pub Date : 2023-05-06 DOI:10.1613/jair.1.13942

Shikhar Tuli, Bhishma Dedhia, Shreshth Tuli, Niraj K. Jha

{"title":"FlexiBERT: Are Current Transformer Architectures too Homogeneous and Rigid?","authors":"Shikhar Tuli, Bhishma Dedhia, Shreshth Tuli, Niraj K. Jha","doi":"10.1613/jair.1.13942","DOIUrl":null,"url":null,"abstract":"The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. However, training such models and exploring their hyperparameter space is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, such works limit analysis to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model has 2.6× smaller size. FlexiBERT-Large, another proposed model, attains state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.","PeriodicalId":54877,"journal":{"name":"Journal of Artificial Intelligence Research","volume":"90 1","pages":"0"},"PeriodicalIF":4.5000,"publicationDate":"2023-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Artificial Intelligence Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1613/jair.1.13942","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 3

Abstract

The existence of a plethora of language models makes the problem of selecting the best one for a custom task challenging. Most state-of-the-art methods leverage transformer-based models (e.g., BERT) or their variants. However, training such models and exploring their hyperparameter space is computationally expensive. Prior work proposes several neural architecture search (NAS) methods that employ performance predictors (e.g., surrogate models) to address this issue; however, such works limit analysis to homogeneous models that use fixed dimensionality throughout the network. This leads to sub-optimal architectures. To address this limitation, we propose a suite of heterogeneous and flexible models, namely FlexiBERT, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions. For better-posed surrogate modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme. We also propose a novel NAS policy, called BOSHNAS, that leverages this new scheme, Bayesian modeling, and second-order optimization, to quickly train and use a neural surrogate model to converge to the optimal architecture. A comprehensive set of experiments shows that the proposed policy, when applied to the FlexiBERT design space, pushes the performance frontier upwards compared to traditional models. FlexiBERT-Mini, one of our proposed models, has 3% fewer parameters than BERT-Mini and achieves 8.9% higher GLUE score. A FlexiBERT model with equivalent performance as the best homogeneous model has 2.6× smaller size. FlexiBERT-Large, another proposed model, attains state-of-the-art results, outperforming the baseline models by at least 5.7% on the GLUE benchmark.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

flexbert:电流互感器架构是否过于单一和僵化?

过多的语言模型的存在使得为自定义任务选择最佳模型的问题具有挑战性。大多数最先进的方法利用基于变压器的模型(例如BERT)或它们的变体。然而，训练这样的模型和探索它们的超参数空间在计算上是昂贵的。先前的工作提出了几种使用性能预测因子(例如代理模型)的神经结构搜索(NAS)方法来解决这个问题;然而，这些工作将分析限制在整个网络中使用固定维数的同质模型。这将导致次优架构。为了解决这个限制，我们提出了一套异构和灵活的模型，即FlexiBERT，它具有不同的编码器层，具有不同的可能操作集和不同的隐藏维度。为了在这个扩展的设计空间中更好地定位代理建模，我们提出了一种新的基于图相似度的嵌入方案。我们还提出了一种新的NAS策略，称为BOSHNAS，它利用这种新方案、贝叶斯建模和二阶优化来快速训练并使用神经代理模型收敛到最优架构。一组全面的实验表明，当将所提出的策略应用于FlexiBERT设计空间时，与传统模型相比，该策略将性能边界推向了更高的水平。flexbert - mini是我们提出的模型之一，其参数比BERT-Mini少3%，GLUE得分高出8.9%。具有同等性能的最佳同质模型的FlexiBERT模型尺寸缩小2.6倍。flexbert - large是另一种被提议的模型，它获得了最先进的结果，在GLUE基准测试中比基准模型至少高出5.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Artificial Intelligence Research 工程技术-计算机：人工智能

CiteScore

9.60

自引率

4.00%

发文量

审稿时长

4 months

期刊介绍： JAIR(ISSN 1076 - 9757) covers all areas of artificial intelligence (AI), publishing refereed research articles, survey articles, and technical notes. Established in 1993 as one of the first electronic scientific journals, JAIR is indexed by INSPEC, Science Citation Index, and MathSciNet. JAIR reviews papers within approximately three months of submission and publishes accepted articles on the internet immediately upon receiving the final versions. JAIR articles are published for free distribution on the internet by the AI Access Foundation, and for purchase in bound volumes by AAAI Press.

期刊最新文献

Collective Belief Revision Competitive Equilibria with a Constant Number of Chores Improving Resource Allocations by Sharing in Pairs A General Model for Aggregating Annotations Across Simple, Complex, and Multi-Object Annotation Tasks Asymptotics of K-Fold Cross Validation