{"title":"基于屋顶线的Intel CnC分布式矩阵乘法性能估计","authors":"Martin Kong, L. Pouchet, P. Sadayappan","doi":"10.1109/IPDPSW.2015.134","DOIUrl":null,"url":null,"abstract":"In this paper we show how to analytically model two widely used distributed matrix-multiply algorithms, Cannon's 2D and Johnson's 3D, implemented within the Intel Concurrent Collections framework for shared/distributed memory execution. Our precise analytical model proceeds by estimating the computation time and communication times, taking into account factors such as the block size, communication bandwidth, processor's peak performance, etc. It then applies a roofline-based approach to determine the running time based on communication/computation bottleneck estimation. Our models are validated by comparing the estimations to the measured run times varying the problem size and work distribution, showing only marginal differences. We conclude by using our model to perform a predictive analysis on the impact of improving the computation speed by a factor of 4×.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Roofline-Based Performance Estimator for Distributed Matrix-Multiply on Intel CnC\",\"authors\":\"Martin Kong, L. Pouchet, P. Sadayappan\",\"doi\":\"10.1109/IPDPSW.2015.134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we show how to analytically model two widely used distributed matrix-multiply algorithms, Cannon's 2D and Johnson's 3D, implemented within the Intel Concurrent Collections framework for shared/distributed memory execution. Our precise analytical model proceeds by estimating the computation time and communication times, taking into account factors such as the block size, communication bandwidth, processor's peak performance, etc. It then applies a roofline-based approach to determine the running time based on communication/computation bottleneck estimation. Our models are validated by comparing the estimations to the measured run times varying the problem size and work distribution, showing only marginal differences. We conclude by using our model to perform a predictive analysis on the impact of improving the computation speed by a factor of 4×.\",\"PeriodicalId\":340697,\"journal\":{\"name\":\"2015 IEEE International Parallel and Distributed Processing Symposium Workshop\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-05-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Parallel and Distributed Processing Symposium Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2015.134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2015.134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Roofline-Based Performance Estimator for Distributed Matrix-Multiply on Intel CnC
In this paper we show how to analytically model two widely used distributed matrix-multiply algorithms, Cannon's 2D and Johnson's 3D, implemented within the Intel Concurrent Collections framework for shared/distributed memory execution. Our precise analytical model proceeds by estimating the computation time and communication times, taking into account factors such as the block size, communication bandwidth, processor's peak performance, etc. It then applies a roofline-based approach to determine the running time based on communication/computation bottleneck estimation. Our models are validated by comparing the estimations to the measured run times varying the problem size and work distribution, showing only marginal differences. We conclude by using our model to perform a predictive analysis on the impact of improving the computation speed by a factor of 4×.