{"title":"在x86-64多核处理器上优化DGL操作","authors":"Chao Liu, Huayou Su, Y. Dou, Qinglin Wang","doi":"10.1145/3546000.3546018","DOIUrl":null,"url":null,"abstract":"Modern x86-64 processors have strong performance due to long vector units. Therefore long vector units are widely used in CNN-like neural network model inference on modern x86-64 processors. However the performance of GNN inference on modern x86-64 processors is poor. Unfortunately, with the development of GNNs and the increase of graph datasets, GNN inference performance meets the serious challenge on x86-64 processors. In this paper, we study the problem of poorly optimized DGL-based GAT models on the x86-64 platform, and analyze the main performance bottlenecks in this case. In order to optimize the performance of DGL on the two main x86-64 platform CPUs of Intel and AMD, we implement a simple and effective task allocator to balance the task load among multiple cores and use vector instructions to optimize the core operators in DGL. In addition, we also propose corresponding optimization ideas for the NUMA architecture. The experimental results show that our optimization method improves the performance of the basic DGL version by up to 3.12x and 2.6x on Intel and AMD platforms.","PeriodicalId":196955,"journal":{"name":"Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimize DGL Operations on x86-64 Multi-Core Processors\",\"authors\":\"Chao Liu, Huayou Su, Y. Dou, Qinglin Wang\",\"doi\":\"10.1145/3546000.3546018\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern x86-64 processors have strong performance due to long vector units. Therefore long vector units are widely used in CNN-like neural network model inference on modern x86-64 processors. However the performance of GNN inference on modern x86-64 processors is poor. Unfortunately, with the development of GNNs and the increase of graph datasets, GNN inference performance meets the serious challenge on x86-64 processors. In this paper, we study the problem of poorly optimized DGL-based GAT models on the x86-64 platform, and analyze the main performance bottlenecks in this case. In order to optimize the performance of DGL on the two main x86-64 platform CPUs of Intel and AMD, we implement a simple and effective task allocator to balance the task load among multiple cores and use vector instructions to optimize the core operators in DGL. In addition, we also propose corresponding optimization ideas for the NUMA architecture. The experimental results show that our optimization method improves the performance of the basic DGL version by up to 3.12x and 2.6x on Intel and AMD platforms.\",\"PeriodicalId\":196955,\"journal\":{\"name\":\"Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3546000.3546018\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on High Performance Compilation, Computing and Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546000.3546018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimize DGL Operations on x86-64 Multi-Core Processors
Modern x86-64 processors have strong performance due to long vector units. Therefore long vector units are widely used in CNN-like neural network model inference on modern x86-64 processors. However the performance of GNN inference on modern x86-64 processors is poor. Unfortunately, with the development of GNNs and the increase of graph datasets, GNN inference performance meets the serious challenge on x86-64 processors. In this paper, we study the problem of poorly optimized DGL-based GAT models on the x86-64 platform, and analyze the main performance bottlenecks in this case. In order to optimize the performance of DGL on the two main x86-64 platform CPUs of Intel and AMD, we implement a simple and effective task allocator to balance the task load among multiple cores and use vector instructions to optimize the core operators in DGL. In addition, we also propose corresponding optimization ideas for the NUMA architecture. The experimental results show that our optimization method improves the performance of the basic DGL version by up to 3.12x and 2.6x on Intel and AMD platforms.