{"title":"使用户级VMM实现确定性并行、非阻塞和高效","authors":"Yu Zhang, Jiange Zhang, Qiliang Zhang","doi":"10.1109/PDCAT.2016.042","DOIUrl":null,"url":null,"abstract":"Many parallel programs are intended to yield deterministic results, but unpredictable thread or process interleavings can lead to subtle bugs and nondeterminism. We proposed a producer-consumer virtual memory–Many parallel programs are intended to yield deterministic results, but unpredictable thread or process interleavings can lead to subtle bugs and nondeterminism. We proposed a producer-consumer virtual memory–SPMC–for efficient system-enforced deterministic parallelism, and prototyped the SPMC model and its software stack entirely in Linux user space, called DLinux. This paper summarizes the implementation policies and limitations in our previous DLinux. To reduce SPMC page fault overhead and suspend/resume overhead which severely degrade the performance of DLinux, we enhance the SPMC model with nonblocking test and direct read and write primitives. Based on the extended SPMC model, we improve the implementation of upper programming abstractions. Experimental results show that relative to the previous version, the new DLinux can improve the performance of NPB workloads up to 2.33X and 1.76X on 8 and 16 processes, respectively. For CG on 8 processes, its runtime relative to MPICH2 decreases from 4.12X to 1.77X. SPMC–for efficient system-enforced deterministic parallelism, and prototyped the SPMC model and its software stack entirely in Linux user space, called DLinux. This paper summarizes the implementation policies and limitations in our previous DLinux. To reduce SPMC page fault overhead and suspend/resume overhead which severely degrade the performance of DLinux, we enhance the SPMC model with nonblocking test and direct read and write primitives. Based on the extended SPMC model, we improve the implementation of upper programming abstractions. Experimental results show that relative to the previous version, the new DLinux can improve the performance of NPB workloads up to 2.33X and 1.76X on 8 and 16 processes, respectively. For CG on 8 processes, its runtime relative to MPICH2 decreases from 4.12X to 1.77X.","PeriodicalId":203925,"journal":{"name":"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Making User-Level VMM for Deterministic Parallelism Nonblocking and Efficient\",\"authors\":\"Yu Zhang, Jiange Zhang, Qiliang Zhang\",\"doi\":\"10.1109/PDCAT.2016.042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many parallel programs are intended to yield deterministic results, but unpredictable thread or process interleavings can lead to subtle bugs and nondeterminism. We proposed a producer-consumer virtual memory–Many parallel programs are intended to yield deterministic results, but unpredictable thread or process interleavings can lead to subtle bugs and nondeterminism. We proposed a producer-consumer virtual memory–SPMC–for efficient system-enforced deterministic parallelism, and prototyped the SPMC model and its software stack entirely in Linux user space, called DLinux. This paper summarizes the implementation policies and limitations in our previous DLinux. To reduce SPMC page fault overhead and suspend/resume overhead which severely degrade the performance of DLinux, we enhance the SPMC model with nonblocking test and direct read and write primitives. Based on the extended SPMC model, we improve the implementation of upper programming abstractions. Experimental results show that relative to the previous version, the new DLinux can improve the performance of NPB workloads up to 2.33X and 1.76X on 8 and 16 processes, respectively. For CG on 8 processes, its runtime relative to MPICH2 decreases from 4.12X to 1.77X. SPMC–for efficient system-enforced deterministic parallelism, and prototyped the SPMC model and its software stack entirely in Linux user space, called DLinux. This paper summarizes the implementation policies and limitations in our previous DLinux. To reduce SPMC page fault overhead and suspend/resume overhead which severely degrade the performance of DLinux, we enhance the SPMC model with nonblocking test and direct read and write primitives. Based on the extended SPMC model, we improve the implementation of upper programming abstractions. Experimental results show that relative to the previous version, the new DLinux can improve the performance of NPB workloads up to 2.33X and 1.76X on 8 and 16 processes, respectively. For CG on 8 processes, its runtime relative to MPICH2 decreases from 4.12X to 1.77X.\",\"PeriodicalId\":203925,\"journal\":{\"name\":\"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"volume\":\"67 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDCAT.2016.042\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 17th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2016.042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Making User-Level VMM for Deterministic Parallelism Nonblocking and Efficient
Many parallel programs are intended to yield deterministic results, but unpredictable thread or process interleavings can lead to subtle bugs and nondeterminism. We proposed a producer-consumer virtual memory–Many parallel programs are intended to yield deterministic results, but unpredictable thread or process interleavings can lead to subtle bugs and nondeterminism. We proposed a producer-consumer virtual memory–SPMC–for efficient system-enforced deterministic parallelism, and prototyped the SPMC model and its software stack entirely in Linux user space, called DLinux. This paper summarizes the implementation policies and limitations in our previous DLinux. To reduce SPMC page fault overhead and suspend/resume overhead which severely degrade the performance of DLinux, we enhance the SPMC model with nonblocking test and direct read and write primitives. Based on the extended SPMC model, we improve the implementation of upper programming abstractions. Experimental results show that relative to the previous version, the new DLinux can improve the performance of NPB workloads up to 2.33X and 1.76X on 8 and 16 processes, respectively. For CG on 8 processes, its runtime relative to MPICH2 decreases from 4.12X to 1.77X. SPMC–for efficient system-enforced deterministic parallelism, and prototyped the SPMC model and its software stack entirely in Linux user space, called DLinux. This paper summarizes the implementation policies and limitations in our previous DLinux. To reduce SPMC page fault overhead and suspend/resume overhead which severely degrade the performance of DLinux, we enhance the SPMC model with nonblocking test and direct read and write primitives. Based on the extended SPMC model, we improve the implementation of upper programming abstractions. Experimental results show that relative to the previous version, the new DLinux can improve the performance of NPB workloads up to 2.33X and 1.76X on 8 and 16 processes, respectively. For CG on 8 processes, its runtime relative to MPICH2 decreases from 4.12X to 1.77X.