{"title":"一天710 SoC上基于gem的云工作负载的任务感知调度和性能优化","authors":"Guosheng Yu, Zhihong Lv, Haijiang Wang, Zilong Huang, Jicheng Chen","doi":"10.1109/AICAS57966.2023.10168586","DOIUrl":null,"url":null,"abstract":"The YiTian710 SoC is a server processor based on ARM Neoverse N2 architecture and developed by T-HEAD Semiconductor Co., Ltd. to accelerate the compute-intensive tasks in Alicloud, where the ML related workloads play an important role in various applications. The General Matrix Multiplication is the fundamental and the most important computing kernel routine extensively utilized in the ML workloads. Generally, the whole GEMM workload is partitioned into a series of blocks and the sub-tasks are professionally assembled to exploit the parallel hardware. However, it is not the case for the cloud workloads which process multi-tasks concurrently and expect guaranteed QoS for commercial consideration. We introduce the task-aware parallel scheduling method to process the ML workloads and balance the response delay and the throughput of the YiTian710 ECS instance. We furtherly design a multi-thread scheduling algorithm with two-level division for the GEMM sub-tasks to achieve high efficiency. The optimized GEMM kernels are developed to attain the optimal performance. We evaluate the performance in YiTian710 based Alicloud ECS for different applications. The results show that our method can achieve remarkable performance improvement for different applications.","PeriodicalId":296649,"journal":{"name":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud\",\"authors\":\"Guosheng Yu, Zhihong Lv, Haijiang Wang, Zilong Huang, Jicheng Chen\",\"doi\":\"10.1109/AICAS57966.2023.10168586\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The YiTian710 SoC is a server processor based on ARM Neoverse N2 architecture and developed by T-HEAD Semiconductor Co., Ltd. to accelerate the compute-intensive tasks in Alicloud, where the ML related workloads play an important role in various applications. The General Matrix Multiplication is the fundamental and the most important computing kernel routine extensively utilized in the ML workloads. Generally, the whole GEMM workload is partitioned into a series of blocks and the sub-tasks are professionally assembled to exploit the parallel hardware. However, it is not the case for the cloud workloads which process multi-tasks concurrently and expect guaranteed QoS for commercial consideration. We introduce the task-aware parallel scheduling method to process the ML workloads and balance the response delay and the throughput of the YiTian710 ECS instance. We furtherly design a multi-thread scheduling algorithm with two-level division for the GEMM sub-tasks to achieve high efficiency. The optimized GEMM kernels are developed to attain the optimal performance. We evaluate the performance in YiTian710 based Alicloud ECS for different applications. The results show that our method can achieve remarkable performance improvement for different applications.\",\"PeriodicalId\":296649,\"journal\":{\"name\":\"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AICAS57966.2023.10168586\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AICAS57966.2023.10168586","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Task-aware Scheduling and Performance Optimization on Yitian710 SoC for GEMM-based Workloads on the Cloud
The YiTian710 SoC is a server processor based on ARM Neoverse N2 architecture and developed by T-HEAD Semiconductor Co., Ltd. to accelerate the compute-intensive tasks in Alicloud, where the ML related workloads play an important role in various applications. The General Matrix Multiplication is the fundamental and the most important computing kernel routine extensively utilized in the ML workloads. Generally, the whole GEMM workload is partitioned into a series of blocks and the sub-tasks are professionally assembled to exploit the parallel hardware. However, it is not the case for the cloud workloads which process multi-tasks concurrently and expect guaranteed QoS for commercial consideration. We introduce the task-aware parallel scheduling method to process the ML workloads and balance the response delay and the throughput of the YiTian710 ECS instance. We furtherly design a multi-thread scheduling algorithm with two-level division for the GEMM sub-tasks to achieve high efficiency. The optimized GEMM kernels are developed to attain the optimal performance. We evaluate the performance in YiTian710 based Alicloud ECS for different applications. The results show that our method can achieve remarkable performance improvement for different applications.