Constructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimization.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Frontiers in Big Data Pub Date : 2025-01-07 eCollection Date: 2024-01-01 DOI:10.3389/fdata.2024.1476506

Revathy Venkataramanan, Aalap Tripathy, Tarun Kumar, Sergey Serebryakov, Annmary Justine, Arpit Shah, Suparna Bhattacharya, Martin Foltin, Paolo Faraboschi, Kaushik Roy, Amit Sheth

{"title":"Constructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimization.","authors":"Revathy Venkataramanan, Aalap Tripathy, Tarun Kumar, Sergey Serebryakov, Annmary Justine, Arpit Shah, Suparna Bhattacharya, Martin Foltin, Paolo Faraboschi, Kaushik Roy, Amit Sheth","doi":"10.3389/fdata.2024.1476506","DOIUrl":null,"url":null,"abstract":"<p><p>The emergence of advanced artificial intelligence (AI) models has driven the development of frameworks and approaches that focus on automating model training and hyperparameter tuning of end-to-end AI pipelines. However, other crucial stages of these pipelines such as dataset selection, feature engineering, and model optimization for deployment have received less attention. Improving efficiency of end-to-end AI pipelines requires metadata of past executions of AI pipelines and all their stages. Regenerating metadata history by re-executing existing AI pipelines is computationally challenging and impractical. To address this issue, we propose to source AI pipeline metadata from open-source platforms such as Papers-with-Code, OpenML, and Hugging Face. However, integrating and unifying the varying terminologies and data formats from these diverse sources is a challenge. In this study, we present a solution by introducing Common Metadata Ontology (CMO) which is used to construct an extensive AI Pipeline Metadata Knowledge Graph (AIMKG) consisting of 1.6 million pipelines. Through semantic enhancements, the pipeline metadata in AIMKG is also enriched for downstream tasks such as search and recommendation of AI pipelines. We perform quantitative and qualitative evaluations on AIMKG to search and recommend relevant pipelines to user query. For quantitative evaluation, we propose a custom aggregation model that outperforms other baselines by achieving a retrieval accuracy (R@1) of 76.3%. Our qualitative analysis shows that AIMKG-based recommender retrieved relevant pipelines in 78% of test cases compared to the state-of-the-art MLSchema-based recommender which retrieved relevant responses in 51% of the cases. AIMKG serves as an atlas for navigating the evolving AI landscape, providing practitioners with a comprehensive factsheet for their applications. It guides AI pipeline optimization, offers insights and recommendations for improving AI pipelines, and serves as a foundation for data mining and analysis of evolving AI workflows.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1476506"},"PeriodicalIF":2.4000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11748301/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2024.1476506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The emergence of advanced artificial intelligence (AI) models has driven the development of frameworks and approaches that focus on automating model training and hyperparameter tuning of end-to-end AI pipelines. However, other crucial stages of these pipelines such as dataset selection, feature engineering, and model optimization for deployment have received less attention. Improving efficiency of end-to-end AI pipelines requires metadata of past executions of AI pipelines and all their stages. Regenerating metadata history by re-executing existing AI pipelines is computationally challenging and impractical. To address this issue, we propose to source AI pipeline metadata from open-source platforms such as Papers-with-Code, OpenML, and Hugging Face. However, integrating and unifying the varying terminologies and data formats from these diverse sources is a challenge. In this study, we present a solution by introducing Common Metadata Ontology (CMO) which is used to construct an extensive AI Pipeline Metadata Knowledge Graph (AIMKG) consisting of 1.6 million pipelines. Through semantic enhancements, the pipeline metadata in AIMKG is also enriched for downstream tasks such as search and recommendation of AI pipelines. We perform quantitative and qualitative evaluations on AIMKG to search and recommend relevant pipelines to user query. For quantitative evaluation, we propose a custom aggregation model that outperforms other baselines by achieving a retrieval accuracy (R@1) of 76.3%. Our qualitative analysis shows that AIMKG-based recommender retrieved relevant pipelines in 78% of test cases compared to the state-of-the-art MLSchema-based recommender which retrieved relevant responses in 51% of the cases. AIMKG serves as an atlas for navigating the evolving AI landscape, providing practitioners with a comprehensive factsheet for their applications. It guides AI pipeline optimization, offers insights and recommendations for improving AI pipelines, and serves as a foundation for data mining and analysis of evolving AI workflows.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

构建元数据知识图作为人工智能管道优化的图谱。

先进人工智能（AI）模型的出现推动了框架和方法的发展，这些框架和方法专注于端到端人工智能管道的自动化模型训练和超参数调优。然而，这些管道的其他关键阶段，如数据集选择、特征工程和部署模型优化，却很少受到关注。提高端到端人工智能管道的效率需要人工智能管道过去执行及其所有阶段的元数据。通过重新执行现有的人工智能管道来重新生成元数据历史在计算上具有挑战性且不切实际。为了解决这个问题，我们建议从开源平台（如Papers-with-Code、OpenML和hugs Face）中获取人工智能管道元数据。然而，集成和统一来自这些不同来源的不同术语和数据格式是一个挑战。在本研究中，我们通过引入公共元数据本体（CMO）提出了一种解决方案，该本体用于构建包含160万条管道的广泛AI管道元数据知识图（AIMKG）。通过语义增强，AIMKG中的管道元数据也得到了丰富，可用于AI管道的搜索和推荐等下游任务。我们对AIMKG进行定量和定性评估，搜索并推荐相关管道供用户查询。为了进行定量评估，我们提出了一个自定义聚合模型，该模型通过实现76.3%的检索准确率（R@1）优于其他基线。我们的定性分析表明，基于aimkg的推荐器在78%的测试用例中检索到相关的管道，而基于最先进的基于mlschema的推荐器在51%的用例中检索到相关的响应。AIMKG作为导航不断发展的人工智能景观的地图集，为从业者提供其应用程序的全面概况。它指导人工智能管道优化，为改进人工智能管道提供见解和建议，并作为数据挖掘和分析不断发展的人工智能工作流的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊