Hardware-middleware system co-design for flexible training of foundation models in the cloud

Seetharami R. Seelam
{"title":"Hardware-middleware system co-design for flexible training of foundation models in the cloud","authors":"Seetharami R. Seelam","doi":"10.1145/3568161.3568317","DOIUrl":null,"url":null,"abstract":"Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.","PeriodicalId":436911,"journal":{"name":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3568161.3568317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于云基础模型灵活训练的硬件中间件系统协同设计
基础模型是一类新的人工智能模型,它们在广泛的数据(通常是通过自我监督)上进行训练,可以用于不同的下游任务。由于自我监督和在大量未标记数据上进行训练的能力,这些模型增长到拥有数千亿个参数,它们需要在数百个GPU上花费数月时间来训练和生成基础模型。因此,人工智能系统和中间件对于以可扩展、经济高效的方式训练这些基础模型至关重要。在这次演讲中,我将讨论一个新的基于云的人工智能系统的架构,以训练大规模的基础模型。该系统完全基于开源软件堆栈构建,从管理程序到客户操作系统,从容器平台到AI框架和库。它内置在IBM Cloud平台中,硬件和软件堆栈经过优化,可以在数百个gpu上训练基础模型。我们在这个平台上以最先进的精度在最短的时间内训练了各种基础模型。我将讨论未来AI系统的架构、操作经验以及硬件和中间件协同设计方向的想法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hardware-middleware system co-design for flexible training of foundation models in the cloud How to find research problems Proceedings of the 23rd International Middleware Conference Extended Abstracts
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1