{"title":"Application Level Reordering of Remote Direct Memory Access Operations","authors":"W. Lavrijsen, Costin Iancu","doi":"10.1109/IPDPS.2017.98","DOIUrl":null,"url":null,"abstract":"We present methods for the effective application level reordering of non-blocking RDMA operations. We supplement out-of-order hardware delivery mechanisms with heuristics to account for the CPU side overhead of communication and for differences in network latency: a runtime scheduler takes into account message sizes, destination and concurrency and reorders operations to improve overall communication throughput. Results are validated on InfiniBand and Cray Aries networks, for SPMD and hybrid (SPMD+OpenMP) programming models. We show up to 5! potential speedup, with 30-50% more typical, for synthetic message patterns in microbenchmarks. We also obtain up to 33% improvement in the communication stages in application settings. While the design space is complex, the resulting scheduler is simple, both internally and at the application level interfaces. It also provides performance portability across networks and programming models. We believe these techniques can be easily retrofitted within any application or runtime framework that uses one-sided communication, e.g. using GASNet, MPI 3.0 RMA or low level APIs such as IBVerbs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2017.98","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
We present methods for the effective application level reordering of non-blocking RDMA operations. We supplement out-of-order hardware delivery mechanisms with heuristics to account for the CPU side overhead of communication and for differences in network latency: a runtime scheduler takes into account message sizes, destination and concurrency and reorders operations to improve overall communication throughput. Results are validated on InfiniBand and Cray Aries networks, for SPMD and hybrid (SPMD+OpenMP) programming models. We show up to 5! potential speedup, with 30-50% more typical, for synthetic message patterns in microbenchmarks. We also obtain up to 33% improvement in the communication stages in application settings. While the design space is complex, the resulting scheduler is simple, both internally and at the application level interfaces. It also provides performance portability across networks and programming models. We believe these techniques can be easily retrofitted within any application or runtime framework that uses one-sided communication, e.g. using GASNet, MPI 3.0 RMA or low level APIs such as IBVerbs.