Constructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimization.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Frontiers in Big Data Pub Date : 2025-01-07 eCollection Date: 2024-01-01 DOI:10.3389/fdata.2024.1476506
Revathy Venkataramanan, Aalap Tripathy, Tarun Kumar, Sergey Serebryakov, Annmary Justine, Arpit Shah, Suparna Bhattacharya, Martin Foltin, Paolo Faraboschi, Kaushik Roy, Amit Sheth
{"title":"Constructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimization.","authors":"Revathy Venkataramanan, Aalap Tripathy, Tarun Kumar, Sergey Serebryakov, Annmary Justine, Arpit Shah, Suparna Bhattacharya, Martin Foltin, Paolo Faraboschi, Kaushik Roy, Amit Sheth","doi":"10.3389/fdata.2024.1476506","DOIUrl":null,"url":null,"abstract":"<p><p>The emergence of advanced artificial intelligence (AI) models has driven the development of frameworks and approaches that focus on automating model training and hyperparameter tuning of end-to-end AI pipelines. However, other crucial stages of these pipelines such as dataset selection, feature engineering, and model optimization for deployment have received less attention. Improving efficiency of end-to-end AI pipelines requires metadata of past executions of AI pipelines and all their stages. Regenerating metadata history by re-executing existing AI pipelines is computationally challenging and impractical. To address this issue, we propose to source AI pipeline metadata from open-source platforms such as Papers-with-Code, OpenML, and Hugging Face. However, integrating and unifying the varying terminologies and data formats from these diverse sources is a challenge. In this study, we present a solution by introducing Common Metadata Ontology (CMO) which is used to construct an extensive AI Pipeline Metadata Knowledge Graph (AIMKG) consisting of 1.6 million pipelines. Through semantic enhancements, the pipeline metadata in AIMKG is also enriched for downstream tasks such as search and recommendation of AI pipelines. We perform quantitative and qualitative evaluations on AIMKG to search and recommend relevant pipelines to user query. For quantitative evaluation, we propose a custom aggregation model that outperforms other baselines by achieving a retrieval accuracy (R@1) of 76.3%. Our qualitative analysis shows that AIMKG-based recommender retrieved relevant pipelines in 78% of test cases compared to the state-of-the-art MLSchema-based recommender which retrieved relevant responses in 51% of the cases. AIMKG serves as an atlas for navigating the evolving AI landscape, providing practitioners with a comprehensive factsheet for their applications. It guides AI pipeline optimization, offers insights and recommendations for improving AI pipelines, and serves as a foundation for data mining and analysis of evolving AI workflows.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1476506"},"PeriodicalIF":2.4000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11748301/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2024.1476506","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

The emergence of advanced artificial intelligence (AI) models has driven the development of frameworks and approaches that focus on automating model training and hyperparameter tuning of end-to-end AI pipelines. However, other crucial stages of these pipelines such as dataset selection, feature engineering, and model optimization for deployment have received less attention. Improving efficiency of end-to-end AI pipelines requires metadata of past executions of AI pipelines and all their stages. Regenerating metadata history by re-executing existing AI pipelines is computationally challenging and impractical. To address this issue, we propose to source AI pipeline metadata from open-source platforms such as Papers-with-Code, OpenML, and Hugging Face. However, integrating and unifying the varying terminologies and data formats from these diverse sources is a challenge. In this study, we present a solution by introducing Common Metadata Ontology (CMO) which is used to construct an extensive AI Pipeline Metadata Knowledge Graph (AIMKG) consisting of 1.6 million pipelines. Through semantic enhancements, the pipeline metadata in AIMKG is also enriched for downstream tasks such as search and recommendation of AI pipelines. We perform quantitative and qualitative evaluations on AIMKG to search and recommend relevant pipelines to user query. For quantitative evaluation, we propose a custom aggregation model that outperforms other baselines by achieving a retrieval accuracy (R@1) of 76.3%. Our qualitative analysis shows that AIMKG-based recommender retrieved relevant pipelines in 78% of test cases compared to the state-of-the-art MLSchema-based recommender which retrieved relevant responses in 51% of the cases. AIMKG serves as an atlas for navigating the evolving AI landscape, providing practitioners with a comprehensive factsheet for their applications. It guides AI pipeline optimization, offers insights and recommendations for improving AI pipelines, and serves as a foundation for data mining and analysis of evolving AI workflows.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
构建元数据知识图作为人工智能管道优化的图谱。
先进人工智能(AI)模型的出现推动了框架和方法的发展,这些框架和方法专注于端到端人工智能管道的自动化模型训练和超参数调优。然而,这些管道的其他关键阶段,如数据集选择、特征工程和部署模型优化,却很少受到关注。提高端到端人工智能管道的效率需要人工智能管道过去执行及其所有阶段的元数据。通过重新执行现有的人工智能管道来重新生成元数据历史在计算上具有挑战性且不切实际。为了解决这个问题,我们建议从开源平台(如Papers-with-Code、OpenML和hugs Face)中获取人工智能管道元数据。然而,集成和统一来自这些不同来源的不同术语和数据格式是一个挑战。在本研究中,我们通过引入公共元数据本体(CMO)提出了一种解决方案,该本体用于构建包含160万条管道的广泛AI管道元数据知识图(AIMKG)。通过语义增强,AIMKG中的管道元数据也得到了丰富,可用于AI管道的搜索和推荐等下游任务。我们对AIMKG进行定量和定性评估,搜索并推荐相关管道供用户查询。为了进行定量评估,我们提出了一个自定义聚合模型,该模型通过实现76.3%的检索准确率(R@1)优于其他基线。我们的定性分析表明,基于aimkg的推荐器在78%的测试用例中检索到相关的管道,而基于最先进的基于mlschema的推荐器在51%的用例中检索到相关的响应。AIMKG作为导航不断发展的人工智能景观的地图集,为从业者提供其应用程序的全面概况。它指导人工智能管道优化,为改进人工智能管道提供见解和建议,并作为数据挖掘和分析不断发展的人工智能工作流的基础。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
5.20
自引率
3.20%
发文量
122
审稿时长
13 weeks
期刊最新文献
Prediction model of middle school student performance based on MBSO and MDBO-BP-Adaboost method. Multi-source data recognition and fusion algorithm based on a two-layer genetic algorithm-back propagation model. Constructing a metadata knowledge graph as an atlas for demystifying AI pipeline optimization. Editorial: Cybersecurity and artificial intelligence: advances, challenges, opportunities, threats. Exploring infodemiology: unraveling the intricate relationships among stress, headaches, migraines, and suicide through Google Trends analysis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1