In this talk, I discuss how our group approaches the most basic question that faces all researchers: how to find good systems problems to work on? Through examples drawn from a research career now spanning nearly 30 years, I will present different problems we have worked on, and how we arrived upon them. The examples will highlight our work in file systems, storage systems, and distributed systems, including older work on reliability and more recent work on distributed systems.
{"title":"How to find research problems","authors":"Remzi H. Arpaci-Dusseau","doi":"10.1145/3568161.3568316","DOIUrl":"https://doi.org/10.1145/3568161.3568316","url":null,"abstract":"In this talk, I discuss how our group approaches the most basic question that faces all researchers: how to find good systems problems to work on? Through examples drawn from a research career now spanning nearly 30 years, I will present different problems we have worked on, and how we arrived upon them. The examples will highlight our work in file systems, storage systems, and distributed systems, including older work on reliability and more recent work on distributed systems.","PeriodicalId":436911,"journal":{"name":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131096270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.
{"title":"Hardware-middleware system co-design for flexible training of foundation models in the cloud","authors":"Seetharami R. Seelam","doi":"10.1145/3568161.3568317","DOIUrl":"https://doi.org/10.1145/3568161.3568317","url":null,"abstract":"Foundation models are a new class of AI models that are trained on broad data (typically via self-supervision) and that can be used in different downstream tasks. Due to self-supervision and the ability to train on massive amounts of unlabeled data, these models grew to have hundreds of billions of parameters, and they take many months on hundreds of GPU to train and generate a foundation model. So, AI Systems and middleware are critical to train these foundation models in scalable, cost-effective manner. In this talk, I will discuss the architecture of a new cloud-based AI System to train large scale foundation models. The system is built entirely out of open source software stack from hypervisor to guest operating systems, from container platforms to AI frameworks and libraries. It is natively built into IBM Cloud platform and the hardware and software stack is optimized for training of foundation models on hundreds of GPUs. We trained various foundation models with state-of-the-art accuracy in the shortest time on this platform. I will discuss the architecture, operational experience, and thoughts on the directions for the co-design of hardware and middleware for future AI Systems.","PeriodicalId":436911,"journal":{"name":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127115700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","authors":"","doi":"10.1145/3568161","DOIUrl":"https://doi.org/10.1145/3568161","url":null,"abstract":"","PeriodicalId":436911,"journal":{"name":"Proceedings of the 23rd International Middleware Conference Extended Abstracts","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125186558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}