图形工作负载的fpga加速事务性执行

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2017-02-22 DOI:10.1145/3020078.3021743

Xiaoyu Ma, Dan Zhang, Derek Chiou

{"title":"图形工作负载的fpga加速事务性执行","authors":"Xiaoyu Ma, Dan Zhang, Derek Chiou","doi":"10.1145/3020078.3021743","DOIUrl":null,"url":null,"abstract":"Many applications that operate on large graphs can be intuitively parallelized by executing a large number of the graph operations concurrently and as transactions to deal with potential conflicts. However, large numbers of operations occurring concurrently might incur too many conflicts that would negate the potential benefits of the parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given the large size and topology of many modern graphs, however, such machines can provide real performance, energy efficiency, and programability benefits. This paper describes an architecture that consists of many lightweight multi-threaded processing engines, a global transactional shared memory, and a work scheduler. We present challenges of realizing such an architecture, especially the requirement of scalable conflict detection, and propose solutions. We also argue that despite increased transaction conflicts due to the higher concurrency and single-thread latency, scalable speedup over serial execution can be achieved. We implement the proposed architecture as a synthesizable FPGA RTL design and demonstrate improved per-socket performance (2X) and energy efficiency (22X) by comparing to a baseline platform that contains two Intel Haswell processors, each with 12 cores.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"27","resultStr":"{\"title\":\"FPGA-Accelerated Transactional Execution of Graph Workloads\",\"authors\":\"Xiaoyu Ma, Dan Zhang, Derek Chiou\",\"doi\":\"10.1145/3020078.3021743\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many applications that operate on large graphs can be intuitively parallelized by executing a large number of the graph operations concurrently and as transactions to deal with potential conflicts. However, large numbers of operations occurring concurrently might incur too many conflicts that would negate the potential benefits of the parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given the large size and topology of many modern graphs, however, such machines can provide real performance, energy efficiency, and programability benefits. This paper describes an architecture that consists of many lightweight multi-threaded processing engines, a global transactional shared memory, and a work scheduler. We present challenges of realizing such an architecture, especially the requirement of scalable conflict detection, and propose solutions. We also argue that despite increased transaction conflicts due to the higher concurrency and single-thread latency, scalable speedup over serial execution can be achieved. We implement the proposed architecture as a synthesizable FPGA RTL design and demonstrate improved per-socket performance (2X) and energy efficiency (22X) by comparing to a baseline platform that contains two Intel Haswell processors, each with 12 cores.\",\"PeriodicalId\":252039,\"journal\":{\"name\":\"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-02-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"27\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3020078.3021743\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3020078.3021743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 27

摘要

许多操作大型图的应用程序可以通过并发地执行大量图操作并将其作为事务处理潜在冲突来直观地并行化。然而，并发发生的大量操作可能会导致太多冲突，从而抵消并行化的潜在好处，这可能会使高度多线程的事务机器看起来不切实际。然而，考虑到许多现代图的大尺寸和拓扑结构，这样的机器可以提供真正的性能、能源效率和可编程性优势。本文描述了一个由许多轻量级多线程处理引擎、全局事务性共享内存和工作调度器组成的体系结构。我们提出了实现这种架构的挑战，特别是对可扩展冲突检测的需求，并提出了解决方案。我们还认为，尽管由于更高的并发性和单线程延迟而增加了事务冲突，但可以实现串行执行的可扩展加速。我们将提出的架构实现为可合成的FPGA RTL设计，并通过与包含两个Intel Haswell处理器的基线平台(每个处理器具有12核)进行比较，展示了改进的每个插槽性能(2X)和能效(22X)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FPGA-Accelerated Transactional Execution of Graph Workloads

Many applications that operate on large graphs can be intuitively parallelized by executing a large number of the graph operations concurrently and as transactions to deal with potential conflicts. However, large numbers of operations occurring concurrently might incur too many conflicts that would negate the potential benefits of the parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given the large size and topology of many modern graphs, however, such machines can provide real performance, energy efficiency, and programability benefits. This paper describes an architecture that consists of many lightweight multi-threaded processing engines, a global transactional shared memory, and a work scheduler. We present challenges of realizing such an architecture, especially the requirement of scalable conflict detection, and propose solutions. We also argue that despite increased transaction conflicts due to the higher concurrency and single-thread latency, scalable speedup over serial execution can be achieved. We implement the proposed architecture as a synthesizable FPGA RTL design and demonstrate improved per-socket performance (2X) and energy efficiency (22X) by comparing to a baseline platform that contains two Intel Haswell processors, each with 12 cores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量