RuQing G. Xu, Field G. Van Zee, Robert A. van de Geijn
{"title":"BLIS中GEMM的统一实现","authors":"RuQing G. Xu, Field G. Van Zee, Robert A. van de Geijn","doi":"10.1145/3577193.3593707","DOIUrl":null,"url":null,"abstract":"Matrix libraries often focus on achieving high performance for problems considered to be either \"small\" or \"large\", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (gemm) that can achieve high performance for both small and large problem sizes. The key is to fuse packing - an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance - with the first computational \"pass\" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a \"small matrix\" strategy and a \"large matrix\" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.","PeriodicalId":424155,"journal":{"name":"Proceedings of the 37th International Conference on Supercomputing","volume":"140 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards a Unified Implementation of GEMM in BLIS\",\"authors\":\"RuQing G. Xu, Field G. Van Zee, Robert A. van de Geijn\",\"doi\":\"10.1145/3577193.3593707\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Matrix libraries often focus on achieving high performance for problems considered to be either \\\"small\\\" or \\\"large\\\", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (gemm) that can achieve high performance for both small and large problem sizes. The key is to fuse packing - an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance - with the first computational \\\"pass\\\" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a \\\"small matrix\\\" strategy and a \\\"large matrix\\\" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.\",\"PeriodicalId\":424155,\"journal\":{\"name\":\"Proceedings of the 37th International Conference on Supercomputing\",\"volume\":\"140 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-06-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 37th International Conference on Supercomputing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3577193.3593707\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 37th International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577193.3593707","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (gemm) that can achieve high performance for both small and large problem sizes. The key is to fuse packing - an operation that copies data to a contiguous layout in memory and which is critical for large matrix performance - with the first computational "pass" over that data. This boosts performance across the problem size spectrum. As a result, tuning general-purpose libraries becomes simpler since it obviates the need to carefully express and parameterize logic that chooses between a "small matrix" strategy and a "large matrix" strategy. A prototype implementation of the technique built with the BLAS-like Library Instantiation Software (BLIS) framework is described and performance on a range of architectures is reported.