HTDcr：用于超级计算机高吞吐量计算的作业执行框架

IF 7.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Science China Information Sciences Pub Date : 2023-12-22 DOI:10.1007/s11432-022-3657-3

Jiazhi Jiang, Dan Huang, Hu Chen, Yutong Lu, Xiangke Liao

{"title":"HTDcr：用于超级计算机高吞吐量计算的作业执行框架","authors":"Jiazhi Jiang, Dan Huang, Hu Chen, Yutong Lu, Xiangke Liao","doi":"10.1007/s11432-022-3657-3","DOIUrl":null,"url":null,"abstract":"<p>High-throughput computing (HTC) is a computing paradigm that aims to accomplish jobs by easily breaking them into smaller, independent components. However, it requires a large amount of computing power for a long time. Most existing HTC frameworks are job-oriented without support for coscheduling with hardware architecture and task-level execution. Also, most of the frameworks reach a limited scale, and their usability needs further improvement. Herein, we present HTDcr, a job execution framework for the HTC on supercomputers. This study aims to improve the throughput, task dispatching, and usability of the framework. In detail, the throughput optimizations include a sophisticated designed task management system, a hierarchical scheduler, and the co-optimization of the task-scheduling strategy with the application and hardware characteristics. The optimizations for usability include a programable execution workflow, mechanisms for more robust and reliable service qualities, and a fine-grained resource allocation system for the colocation of multiple jobs. According to our evaluations, HTDcr can achieve outstanding scalability and high throughput on large-scale clusters for the HTC workload. We evaluate HTDcr with several microbenchmarks and real-world applications on Tianhe-2 and Sunway TaihuLight to demonstrate its effects on existing design mechanisms. For instance, the task scheduling for two real-world applications integrated with the application and hardware characteristics achieves 1.7× and 1.9× speedups over the basic task-scheduling strategy.</p>","PeriodicalId":21618,"journal":{"name":"Science China Information Sciences","volume":"10 1","pages":""},"PeriodicalIF":7.3000,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HTDcr: a job execution framework for high-throughput computing on supercomputers\",\"authors\":\"Jiazhi Jiang, Dan Huang, Hu Chen, Yutong Lu, Xiangke Liao\",\"doi\":\"10.1007/s11432-022-3657-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>High-throughput computing (HTC) is a computing paradigm that aims to accomplish jobs by easily breaking them into smaller, independent components. However, it requires a large amount of computing power for a long time. Most existing HTC frameworks are job-oriented without support for coscheduling with hardware architecture and task-level execution. Also, most of the frameworks reach a limited scale, and their usability needs further improvement. Herein, we present HTDcr, a job execution framework for the HTC on supercomputers. This study aims to improve the throughput, task dispatching, and usability of the framework. In detail, the throughput optimizations include a sophisticated designed task management system, a hierarchical scheduler, and the co-optimization of the task-scheduling strategy with the application and hardware characteristics. The optimizations for usability include a programable execution workflow, mechanisms for more robust and reliable service qualities, and a fine-grained resource allocation system for the colocation of multiple jobs. According to our evaluations, HTDcr can achieve outstanding scalability and high throughput on large-scale clusters for the HTC workload. We evaluate HTDcr with several microbenchmarks and real-world applications on Tianhe-2 and Sunway TaihuLight to demonstrate its effects on existing design mechanisms. For instance, the task scheduling for two real-world applications integrated with the application and hardware characteristics achieves 1.7× and 1.9× speedups over the basic task-scheduling strategy.</p>\",\"PeriodicalId\":21618,\"journal\":{\"name\":\"Science China Information Sciences\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":7.3000,\"publicationDate\":\"2023-12-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Science China Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11432-022-3657-3\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science China Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11432-022-3657-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

高通量计算（HTC）是一种计算范式，旨在通过将工作轻松分解成更小的独立组件来完成工作。然而，它需要长时间的大量计算能力。现有的大多数 HTC 框架都是面向作业的，不支持与硬件架构和任务级执行的协同调度。而且，大多数框架的规模有限，可用性有待进一步提高。在此，我们提出了超级计算机上的 HTC 作业执行框架 HTDcr。本研究旨在提高该框架的吞吐量、任务调度和可用性。具体来说，吞吐量优化包括设计精密的任务管理系统、分层调度器，以及任务调度策略与应用和硬件特性的共同优化。对可用性的优化包括可编程的执行工作流程、更稳健可靠的服务质量机制，以及用于多个任务分配的细粒度资源分配系统。根据我们的评估，HTDcr 可以在大规模集群上为 HTC 工作负载实现出色的可扩展性和高吞吐量。我们在 "天河二号 "和 "双威太湖之光 "上使用多个微基准测试和实际应用对 HTDcr 进行了评估，以证明其对现有设计机制的影响。例如，与基本任务调度策略相比，两个实际应用的任务调度结合了应用和硬件特性，分别提高了 1.7 倍和 1.9 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

HTDcr: a job execution framework for high-throughput computing on supercomputers

High-throughput computing (HTC) is a computing paradigm that aims to accomplish jobs by easily breaking them into smaller, independent components. However, it requires a large amount of computing power for a long time. Most existing HTC frameworks are job-oriented without support for coscheduling with hardware architecture and task-level execution. Also, most of the frameworks reach a limited scale, and their usability needs further improvement. Herein, we present HTDcr, a job execution framework for the HTC on supercomputers. This study aims to improve the throughput, task dispatching, and usability of the framework. In detail, the throughput optimizations include a sophisticated designed task management system, a hierarchical scheduler, and the co-optimization of the task-scheduling strategy with the application and hardware characteristics. The optimizations for usability include a programable execution workflow, mechanisms for more robust and reliable service qualities, and a fine-grained resource allocation system for the colocation of multiple jobs. According to our evaluations, HTDcr can achieve outstanding scalability and high throughput on large-scale clusters for the HTC workload. We evaluate HTDcr with several microbenchmarks and real-world applications on Tianhe-2 and Sunway TaihuLight to demonstrate its effects on existing design mechanisms. For instance, the task scheduling for two real-world applications integrated with the application and hardware characteristics achieves 1.7× and 1.9× speedups over the basic task-scheduling strategy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Science China Information Sciences COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

12.60

自引率

5.70%

发文量

224

审稿时长

8.3 months

期刊介绍： Science China Information Sciences is a dedicated journal that showcases high-quality, original research across various domains of information sciences. It encompasses Computer Science & Technologies, Control Science & Engineering, Information & Communication Engineering, Microelectronics & Solid-State Electronics, and Quantum Information, providing a platform for the dissemination of significant contributions in these fields.