{"title":"改进通用并行代码的运行时调度","authors":"Alexandros Tzannes, R. Barua, U. Vishkin","doi":"10.1109/PACT.2011.49","DOIUrl":null,"url":null,"abstract":"Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {\\em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Improving Run-Time Scheduling for General-Purpose Parallel Code\",\"authors\":\"Alexandros Tzannes, R. Barua, U. Vishkin\",\"doi\":\"10.1109/PACT.2011.49\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {\\\\em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.\",\"PeriodicalId\":106423,\"journal\":{\"name\":\"2011 International Conference on Parallel Architectures and Compilation Techniques\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 International Conference on Parallel Architectures and Compilation Techniques\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2011.49\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.49","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Improving Run-Time Scheduling for General-Purpose Parallel Code
Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {\em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.