SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems

IF 1.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Journal of Computer Science and Technology Pub Date : 2024-06-06 DOI:10.1007/s11390-023-1840-y

Jiang-Su Du, Dong-Sheng Li, Ying-Peng Wen, Jia-Zhi Jiang, Dan Huang, Xiang-Ke Liao, Yu-Tong Lu

{"title":"SAIH: A Scalable Evaluation Methodology for Understanding AI Performance Trend on HPC Systems","authors":"Jiang-Su Du, Dong-Sheng Li, Ying-Peng Wen, Jia-Zhi Jiang, Dan Huang, Xiang-Ke Liao, Yu-Tong Lu","doi":"10.1007/s11390-023-1840-y","DOIUrl":null,"url":null,"abstract":"<p>Novel artificial intelligence (AI) technology has expedited various scientific research, e.g., cosmology, physics, and bioinformatics, inevitably becoming a significant category of workload on high-performance computing (HPC) systems. Existing AI benchmarks tend to customize well-recognized AI applications, so as to evaluate the AI performance of HPC systems under the predefined problem size, in terms of datasets and AI models. However, driven by novel AI technology, most of AI applications are evolving fast on models and datasets to achieve higher accuracy and be applicable to more scenarios. Due to the lack of scalability on the problem size, static AI benchmarks might be under competent to help understand the performance trend of evolving AI applications on HPC systems, in particular, the scientific AI applications on large-scale systems. In this paper, we propose a scalable evaluation methodology (SAIH) for analyzing the AI performance trend of HPC systems with scaling the problem sizes of customized AI applications. To enable scalability, SAIH builds a set of novel mechanisms for augmenting problem sizes. As the data and model constantly scale, we can investigate the trend and range of AI performance on HPC systems, and further diagnose system bottlenecks. To verify our methodology, we augment a cosmological AI application to evaluate a real HPC system equipped with GPUs as a case study of SAIH. With data and model augment, SAIH can progressively evaluate the AI performance trend of HPC systems, e.g., increasing from 5.2% to 59.6% of the peak theoretical hardware performance. The evaluation results are analyzed and summarized into insight findings on performance issues. For instance, we find that the AI application constantly consumes the I/O bandwidth of the shared parallel file system during its iteratively training model. If I/O contention exists, the shared parallel file system might become a bottleneck.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"17 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11390-023-1840-y","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Novel artificial intelligence (AI) technology has expedited various scientific research, e.g., cosmology, physics, and bioinformatics, inevitably becoming a significant category of workload on high-performance computing (HPC) systems. Existing AI benchmarks tend to customize well-recognized AI applications, so as to evaluate the AI performance of HPC systems under the predefined problem size, in terms of datasets and AI models. However, driven by novel AI technology, most of AI applications are evolving fast on models and datasets to achieve higher accuracy and be applicable to more scenarios. Due to the lack of scalability on the problem size, static AI benchmarks might be under competent to help understand the performance trend of evolving AI applications on HPC systems, in particular, the scientific AI applications on large-scale systems. In this paper, we propose a scalable evaluation methodology (SAIH) for analyzing the AI performance trend of HPC systems with scaling the problem sizes of customized AI applications. To enable scalability, SAIH builds a set of novel mechanisms for augmenting problem sizes. As the data and model constantly scale, we can investigate the trend and range of AI performance on HPC systems, and further diagnose system bottlenecks. To verify our methodology, we augment a cosmological AI application to evaluate a real HPC system equipped with GPUs as a case study of SAIH. With data and model augment, SAIH can progressively evaluate the AI performance trend of HPC systems, e.g., increasing from 5.2% to 59.6% of the peak theoretical hardware performance. The evaluation results are analyzed and summarized into insight findings on performance issues. For instance, we find that the AI application constantly consumes the I/O bandwidth of the shared parallel file system during its iteratively training model. If I/O contention exists, the shared parallel file system might become a bottleneck.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SAIH：用于了解高性能计算系统上人工智能性能趋势的可扩展评估方法

新颖的人工智能（AI）技术加速了各种科学研究，如宇宙学、物理学和生物信息学，不可避免地成为高性能计算（HPC）系统的重要工作负载类别。现有的人工智能基准往往是定制公认的人工智能应用，以便在预定义的问题规模下，从数据集和人工智能模型方面评估高性能计算系统的人工智能性能。然而，在新型人工智能技术的推动下，大多数人工智能应用都在模型和数据集上快速发展，以实现更高的准确性并适用于更多场景。由于缺乏对问题规模的可扩展性，静态人工智能基准可能无法帮助理解 HPC 系统上不断演进的人工智能应用的性能趋势，特别是大规模系统上的科学人工智能应用。在本文中，我们提出了一种可扩展的评估方法（SAIH），用于分析 HPC 系统的人工智能性能趋势，同时扩展定制化人工智能应用的问题规模。为了实现可扩展性，SAIH 建立了一套新颖的问题规模扩展机制。随着数据和模型的不断扩展，我们可以研究 HPC 系统上人工智能性能的趋势和范围，并进一步诊断系统瓶颈。为了验证我们的方法，我们增强了一个宇宙学人工智能应用，以评估配备了 GPU 的真实 HPC 系统，作为 SAIH 的案例研究。随着数据和模型的增强，SAIH 可以逐步评估 HPC 系统的人工智能性能趋势，例如，从理论硬件性能峰值的 5.2% 提高到 59.6%。我们对评估结果进行了分析，并将其总结为对性能问题的深刻见解。例如，我们发现人工智能应用在迭代训练模型的过程中会不断消耗共享并行文件系统的 I/O 带宽。如果存在 I/O 竞争，共享并行文件系统可能会成为瓶颈。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Computer Science and Technology 工程技术-计算机：软件工程

CiteScore

4.00

自引率

0.00%

发文量

2255

审稿时长

9.8 months

期刊介绍： Journal of Computer Science and Technology (JCST), the first English language journal in the computer field published in China, is an international forum for scientists and engineers involved in all aspects of computer science and technology to publish high quality and refereed papers. Papers reporting original research and innovative applications from all parts of the world are welcome. Papers for publication in the journal are selected through rigorous peer review, to ensure originality, timeliness, relevance, and readability. While the journal emphasizes the publication of previously unpublished materials, selected conference papers with exceptional merit that require wider exposure are, at the discretion of the editors, also published, provided they meet the journal''s peer review standards. The journal also seeks clearly written survey and review articles from experts in the field, to promote insightful understanding of the state-of-the-art and technology trends. Topics covered by Journal of Computer Science and Technology include but are not limited to: -Computer Architecture and Systems -Artificial Intelligence and Pattern Recognition -Computer Networks and Distributed Computing -Computer Graphics and Multimedia -Software Systems -Data Management and Data Mining -Theory and Algorithms -Emerging Areas