Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster
{"title":"Employing Artificial Intelligence to Steer Exascale Workflows with Colmena","authors":"Logan Ward, J. Gregory Pauloski, Valerie Hayot-Sasson, Yadu Babuji, Alexander Brace, Ryan Chard, Kyle Chard, Rajeev Thakur, Ian Foster","doi":"arxiv-2408.14434","DOIUrl":null,"url":null,"abstract":"Computational workflows are a common class of application on supercomputers,\nyet the loosely coupled and heterogeneous nature of workflows often fails to\ntake full advantage of their capabilities. We created Colmena to leverage the\nmassive parallelism of a supercomputer by using Artificial Intelligence (AI) to\nlearn from and adapt a workflow as it executes. Colmena allows scientists to\ndefine how their application should respond to events (e.g., task completion)\nas a series of cooperative agents. In this paper, we describe the design of\nColmena, the challenges we overcame while deploying applications on exascale\nsystems, and the science workflows we have enhanced through interweaving AI.\nThe scaling challenges we discuss include developing steering strategies that\nmaximize node utilization, introducing data fabrics that reduce communication\noverhead of data-intensive tasks, and implementing workflow tasks that cache\ncostly operations between invocations. These innovations coupled with a variety\nof application patterns accessible through our agent-based steering model have\nenabled science advances in chemistry, biophysics, and materials science using\ndifferent types of AI. Our vision is that Colmena will spur creative solutions\nthat harness AI across many domains of scientific computing.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.14434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Computational workflows are a common class of application on supercomputers,
yet the loosely coupled and heterogeneous nature of workflows often fails to
take full advantage of their capabilities. We created Colmena to leverage the
massive parallelism of a supercomputer by using Artificial Intelligence (AI) to
learn from and adapt a workflow as it executes. Colmena allows scientists to
define how their application should respond to events (e.g., task completion)
as a series of cooperative agents. In this paper, we describe the design of
Colmena, the challenges we overcame while deploying applications on exascale
systems, and the science workflows we have enhanced through interweaving AI.
The scaling challenges we discuss include developing steering strategies that
maximize node utilization, introducing data fabrics that reduce communication
overhead of data-intensive tasks, and implementing workflow tasks that cache
costly operations between invocations. These innovations coupled with a variety
of application patterns accessible through our agent-based steering model have
enabled science advances in chemistry, biophysics, and materials science using
different types of AI. Our vision is that Colmena will spur creative solutions
that harness AI across many domains of scientific computing.