Wei Gao, Jiarui Fang, Wenlai Zhao, Jinzhe Yang, Long Wang, L. Gan, H. Fu, Guangwen Yang
{"title":"swATOP","authors":"Wei Gao, Jiarui Fang, Wenlai Zhao, Jinzhe Yang, Long Wang, L. Gan, H. Fu, Guangwen Yang","doi":"10.1145/3337821.3337883","DOIUrl":null,"url":null,"abstract":"Achieving an optimized mapping of Deep Learning (DL) operators to new hardware architectures is the key to building a scalable DL system. However, handcrafted optimization involves huge engineering efforts, due to the variety of DL operator implementations and complex programming skills. Targeting the innovative many-core processor SW26010 adopted by the 3rd fastest supercomputer Sunway TaihuLight, an end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization. Arithmetic intensive DL operators are expressed into an auto-tuning-friendly form, which is based on tensorized primitives. By describing the algorithm of a DL operator using our domain specific language (DSL), swATOP is able to derive and produce an optimal implementation by separating hardware-dependent optimization and hardware-agnostic optimization. Hardware-dependent optimization is encapsulated in a set of tensorized primitives with sufficient utilization of the underlying hardware features. The hardware-agnostic optimization contains a scheduler, an intermediate representation (IR) optimizer, an auto-tuner, and a code generator. These modules cooperate to perform an automatic design space exploration, to apply a set of programming techniques, to discover a near-optimal solution, and to generate the executable code. Our experiments show that swATOP is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Compared to a black-box autotuner, the tuning and code generation time can be reduced to minutes from days using swATOP.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Achieving an optimized mapping of Deep Learning (DL) operators to new hardware architectures is the key to building a scalable DL system. However, handcrafted optimization involves huge engineering efforts, due to the variety of DL operator implementations and complex programming skills. Targeting the innovative many-core processor SW26010 adopted by the 3rd fastest supercomputer Sunway TaihuLight, an end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization. Arithmetic intensive DL operators are expressed into an auto-tuning-friendly form, which is based on tensorized primitives. By describing the algorithm of a DL operator using our domain specific language (DSL), swATOP is able to derive and produce an optimal implementation by separating hardware-dependent optimization and hardware-agnostic optimization. Hardware-dependent optimization is encapsulated in a set of tensorized primitives with sufficient utilization of the underlying hardware features. The hardware-agnostic optimization contains a scheduler, an intermediate representation (IR) optimizer, an auto-tuner, and a code generator. These modules cooperate to perform an automatic design space exploration, to apply a set of programming techniques, to discover a near-optimal solution, and to generate the executable code. Our experiments show that swATOP is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Compared to a black-box autotuner, the tuning and code generation time can be reduced to minutes from days using swATOP.