Termination detection for fine-grained message-passing architectures

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2020-07-01 DOI:10.1109/ASAP49362.2020.00012

Matthew Naylor, S. Moore, A. Mokhov, David B. Thomas, J. Beaumont, Shane T. Fleming, A. T. Markettos, Thomas Bytheway, Andrew D. Brown

{"title":"Termination detection for fine-grained message-passing architectures","authors":"Matthew Naylor, S. Moore, A. Mokhov, David B. Thomas, J. Beaumont, Shane T. Fleming, A. T. Markettos, Thomas Bytheway, Andrew D. Brown","doi":"10.1109/ASAP49362.2020.00012","DOIUrl":null,"url":null,"abstract":"Barrier primitives provided by standard parallel programming APIs are the primary means by which applications implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not interact with message-passing in any useful way.In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive, efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both globally-synchronous and asynchronous parallel applications.To evaluate the new primitive, we implement it in a prototype large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP49362.2020.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Barrier primitives provided by standard parallel programming APIs are the primary means by which applications implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not interact with message-passing in any useful way.In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive, efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both globally-synchronous and asynchronous parallel applications.To evaluate the new primitive, we implement it in a prototype large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

细粒度消息传递体系结构的终止检测

标准并行编程api提供的屏障原语是应用程序实现全局同步的主要手段。通常，这些原语完全致力于同步，因为一旦进入障碍，同步是唯一的出路。对于消息传递应用程序，这就提出了一个问题:当消息到达已经驻留在屏障中的线程时，会发生什么情况。如果没有满意的答案，屏障就不会以任何有用的方式与消息传递进行交互。在本文中，我们提出了一种新的可反驳的屏障原语，它与消息传递相结合，形成了一个简单、富有表现力、高效、定义良好的API。它具有基于终止检测的清晰语义，并支持全局同步和异步并行应用程序的开发。为了评估新的原语，我们在一个大型消息传递机器的原型中实现了它，该机器有49,152个RISC-V线程，分布在48个fpga上。我们展示了对原语的硬件支持导致了高效的实现，能够实现比软件可实现的同步率高一个数量级的同步率。使用原语，我们实现了一系列应用程序的同步和异步版本，并观察到每个版本都比其他版本具有显著的优势，这取决于应用程序。因此，支持两种风格的屏障原语可以极大地帮助并行程序的开发。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

自引率

0.00%

发文量

期刊最新文献

ASAP 2020 Committees Persistent Fault Analysis of Neural Networks on FPGA-based Acceleration System Anytime Floating-Point Addition and Multiplication-Concepts and Implementations FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort An Efficient Convolution Engine based on the À-trous Spatial Pyramid Pooling