首页 > 最新文献

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文 中文
Performance Comparison of Multiple Approaches of Status Register for Medium Density Memory Suitable for Implementation of a Lossless Compression Dictionary: (Abstract Only) 适合实现无损压缩字典的多种中密度存储器状态寄存器方法的性能比较(摘要)
Matěj Bartík, S. Ubik, P. Kubalík, Tomás Benes
This paper presents a performance comparison of various approaches of realization of status register suitable for maintaining (in)valid bits in mid-density memory structures implemented in Xilinx FPGAs. An example of a such structure with status register could be a dictionary for Lempel-Ziv based lossless compression algorithms where the dictionary has to be initialized before each run of the algorithm with minimum time and logic resources consumption. The performance evaluation of designs has been made in Xilinx ISE and Vivado toolkits for the Virtex-7 FPGA. This research has been partially supported by the CTU project SGS17/017/OHK3/1T/18 "Dependable and attack-resistant architectures for programmable devices" and by the project "E-infrastructure CESNET "modernization" no. CZ.02.1.01/0.0/0.0/16 013/0001797.
本文介绍了适用于在Xilinx fpga实现的中密度存储器结构中保持有效位的状态寄存器的各种实现方法的性能比较。使用状态寄存器的这种结构的一个例子可以是基于Lempel-Ziv的无损压缩算法的字典,其中字典必须在每次运行算法之前初始化,以最小的时间和逻辑资源消耗。在Xilinx ISE和Vivado工具包中对Virtex-7 FPGA的设计进行了性能评估。这项研究得到了CTU项目SGS17/017/OHK3/1T/18“可编程设备的可靠和抗攻击架构”和“E-infrastructure CESNET”现代化项目的部分支持。CZ.02.1.01/0.0/0.0/16 013/0001797。
{"title":"Performance Comparison of Multiple Approaches of Status Register for Medium Density Memory Suitable for Implementation of a Lossless Compression Dictionary: (Abstract Only)","authors":"Matěj Bartík, S. Ubik, P. Kubalík, Tomás Benes","doi":"10.1145/3174243.3174976","DOIUrl":"https://doi.org/10.1145/3174243.3174976","url":null,"abstract":"This paper presents a performance comparison of various approaches of realization of status register suitable for maintaining (in)valid bits in mid-density memory structures implemented in Xilinx FPGAs. An example of a such structure with status register could be a dictionary for Lempel-Ziv based lossless compression algorithms where the dictionary has to be initialized before each run of the algorithm with minimum time and logic resources consumption. The performance evaluation of designs has been made in Xilinx ISE and Vivado toolkits for the Virtex-7 FPGA. This research has been partially supported by the CTU project SGS17/017/OHK3/1T/18 \"Dependable and attack-resistant architectures for programmable devices\" and by the project \"E-infrastructure CESNET \"modernization\" no. CZ.02.1.01/0.0/0.0/16 013/0001797.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131548720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance Comparison of Multiples and Target Detection with Imager-driven Processing Mode for Ultrafast-Imager: (Abstract Only) 基于成像仪驱动处理模式的超快成像仪多倍和目标检测性能比较:(摘要)
Xiaoyu Yu, D. Ye
Latest vision tasks trend to be the real-time processing with high throughput frame rate and low latency. High spatiotemporal resolution imagers continue to spring up but only a few of them can be used in real applications owing to the excessive computational burden and lacking of suitable architecture. This paper presents a solution for target detection task in imager-driven processing mode (IMP), which takes shorter time in processing than the time gap between frames, even if the ulreafast imager run at full frame rate. High throughput pixel stream outputted from imager is analyzed base on multi features in a fully pipelined and bufferless architecture in FPGA. A pyramid shape model consisting of 2-D Processing Element (PE) array is proposed to search the connected regions of target candidates distributed at different time slices, and extract corresponding features when the stream pass through. A Label based 1-D PE Array collects the feature flow generated by the pyramid according to their labels, and output the feature vector of each target candidate in real time. The proposed model has been tested in simulation and experiments for target detection with 0.8Gpixel/sec (2320×1726 with 192FPS) data stream input, and the latency is less than 1 microsecond.
最新的视觉任务趋向于高吞吐量、低时延、高帧率的实时处理。高时空分辨率成像仪不断涌现,但由于计算负担过重和缺乏合适的结构,能够用于实际应用的成像仪很少。本文提出了一种基于图像驱动处理模式(IMP)的目标检测任务解决方案,该解决方案在超高速图像以全帧速率运行的情况下,其处理时间比帧间时间间隔要短。在FPGA全流水线无缓冲架构下,对成像仪输出的高吞吐量像素流进行了分析。提出了一种由二维处理单元(PE)阵列组成的金字塔形状模型,用于搜索分布在不同时间片上的候选目标的连通区域,并在流通过时提取相应的特征。基于标签的一维PE阵列根据金字塔生成的特征流的标签进行采集,并实时输出每个候选目标的特征向量。该模型已在0.8Gpixel/sec (2320×1726, 192FPS)数据流输入下的目标检测仿真和实验中得到验证,延时小于1微秒。
{"title":"Performance Comparison of Multiples and Target Detection with Imager-driven Processing Mode for Ultrafast-Imager: (Abstract Only)","authors":"Xiaoyu Yu, D. Ye","doi":"10.1145/3174243.3174990","DOIUrl":"https://doi.org/10.1145/3174243.3174990","url":null,"abstract":"Latest vision tasks trend to be the real-time processing with high throughput frame rate and low latency. High spatiotemporal resolution imagers continue to spring up but only a few of them can be used in real applications owing to the excessive computational burden and lacking of suitable architecture. This paper presents a solution for target detection task in imager-driven processing mode (IMP), which takes shorter time in processing than the time gap between frames, even if the ulreafast imager run at full frame rate. High throughput pixel stream outputted from imager is analyzed base on multi features in a fully pipelined and bufferless architecture in FPGA. A pyramid shape model consisting of 2-D Processing Element (PE) array is proposed to search the connected regions of target candidates distributed at different time slices, and extract corresponding features when the stream pass through. A Label based 1-D PE Array collects the feature flow generated by the pyramid according to their labels, and output the feature vector of each target candidate in real time. The proposed model has been tested in simulation and experiments for target detection with 0.8Gpixel/sec (2320×1726 with 192FPS) data stream input, and the latency is less than 1 microsecond.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132836845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding Performance Differences of FPGAs and GPUs: (Abtract Only) 了解fpga和gpu的性能差异:(摘要)
J. Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, Shaochong Zhang
The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which for GPUs, and why? In this paper, our goal is to better understand the performance differences between FPGAs and GPUs and provide more insights to the community. We intentionally start with a widely used GPU-friendly benchmark suite Rodinia, and port 11 of the benchmarks (15 kernels) onto FPGAs using the more portable and programmable high-level synthesis C. We provide a simple five-step strategy for FPGA accelerator designs that can be easily understood and mastered by software programmers, and present a quantitative performance breakdown of each step. Then we propose a set of performance metrics, including normalized operations per cycle (OPC_norm) for each pipeline, and effective parallel factor (effective_para_factor), to compare the performance of GPU and FPGA accelerator designs. We find that for 6 out of the 15 kernels, today's FPGAs can provide comparable performance or even achieve better performance, while only consume about 1/10 of GPUs' power (both on the same technology node). We observe that FPGAs usually have higher OPC_norm in most kernels in light of their customized deep pipeline but lower effective_para_factor due to far lower memory bandwidth than GPUs. Future FPGAs should increase their off-chip bandwidth and clock frequency to catch up with GPUs.
臭名昭著的功率墙极大地限制了通用处理器的扩展。为了解决这个问题,各种各样的加速器,如gpu和fpga,出现了,以实现更好的性能和能源效率。在这两个可编程加速器之间,一个自然的问题出现了:哪个应用程序更适合fpga,哪个更适合gpu,为什么?在本文中,我们的目标是更好地理解fpga和gpu之间的性能差异,并为社区提供更多见解。我们有意从广泛使用的gpu友好基准套件Rodinia开始,并使用更易于移植和可编程的高级合成c将11个基准(15个内核)移植到FPGA上。我们为FPGA加速器设计提供了一个简单的五步策略,可以很容易地被软件程序员理解和掌握,并给出了每个步骤的定量性能分解。然后,我们提出了一组性能指标,包括每个管道的标准化每周期操作(OPC_norm)和有效并行因子(effecve_para_factor),以比较GPU和FPGA加速器设计的性能。我们发现,对于15个内核中的6个,今天的fpga可以提供相当的性能甚至达到更好的性能,而只消耗大约gpu的1/10的功率(两者都在相同的技术节点上)。我们观察到,fpga在大多数内核中通常具有更高的OPC_norm,因为它们的定制深度管道,但由于内存带宽远低于gpu,因此效率系数较低。未来的fpga应该提高其片外带宽和时钟频率,以赶上gpu。
{"title":"Understanding Performance Differences of FPGAs and GPUs: (Abtract Only)","authors":"J. Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, Shaochong Zhang","doi":"10.1145/3174243.3174970","DOIUrl":"https://doi.org/10.1145/3174243.3174970","url":null,"abstract":"The notorious power wall has significantly limited the scaling for general-purpose processors. To address this issue, various accelerators, such as GPUs and FPGAs, emerged to achieve better performance and energy-efficiency. Between these two programmable accelerators, a natural question arises: which applications are better suited for FPGAs, which for GPUs, and why? In this paper, our goal is to better understand the performance differences between FPGAs and GPUs and provide more insights to the community. We intentionally start with a widely used GPU-friendly benchmark suite Rodinia, and port 11 of the benchmarks (15 kernels) onto FPGAs using the more portable and programmable high-level synthesis C. We provide a simple five-step strategy for FPGA accelerator designs that can be easily understood and mastered by software programmers, and present a quantitative performance breakdown of each step. Then we propose a set of performance metrics, including normalized operations per cycle (OPC_norm) for each pipeline, and effective parallel factor (effective_para_factor), to compare the performance of GPU and FPGA accelerator designs. We find that for 6 out of the 15 kernels, today's FPGAs can provide comparable performance or even achieve better performance, while only consume about 1/10 of GPUs' power (both on the same technology node). We observe that FPGAs usually have higher OPC_norm in most kernels in light of their customized deep pipeline but lower effective_para_factor due to far lower memory bandwidth than GPUs. Future FPGAs should increase their off-chip bandwidth and clock frequency to catch up with GPUs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131663016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Session details: Special Session: Deep Learning 专题会议:深度学习
A. Ling
{"title":"Session details: Special Session: Deep Learning","authors":"A. Ling","doi":"10.1145/3252935","DOIUrl":"https://doi.org/10.1145/3252935","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114840298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator DeltaRNN:一种高效的循环神经网络加速器
Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, T. Delbrück
Recurrent Neural Networks (RNNs) are widely used in speech recognition and natural language processing applications because of their capability to process temporal sequences. Because RNNs are fully connected, they require a large number of weight memory accesses, leading to high power consumption. Recent theory has shown that an RNN delta network update approach can reduce memory access and computes with negligible accuracy loss. This paper describes the implementation of this theoretical approach in a hardware accelerator called "DeltaRNN" (DRNN). The DRNN updates the output of a neuron only when the neuron»s activation changes by more than a delta threshold. It was implemented on a Xilinx Zynq-7100 FPGA. FPGA measurement results from a single-layer RNN of 256 Gated Recurrent Unit (GRU) neurons show that the DRNN achieves 1.2 TOp/s effective throughput and 164 GOp/s/W power efficiency. The delta update leads to a 5.7x speedup compared to a conventional RNN update because of the sparsity created by the DN algorithm and the zero-skipping ability of DRNN.
递归神经网络(RNNs)因其处理时间序列的能力而广泛应用于语音识别和自然语言处理领域。由于rnn是完全连接的,因此需要大量的权重内存访问,从而导致高功耗。最近的理论表明,RNN增量网络更新方法可以减少内存访问,计算精度损失可以忽略不计。本文描述了这一理论方法在一个名为“DeltaRNN”(DRNN)的硬件加速器中的实现。只有当神经元的激活变化超过一个增量阈值时,DRNN才会更新神经元的输出。它是在Xilinx Zynq-7100 FPGA上实现的。对256个门控循环单元(GRU)神经元的单层RNN的FPGA测量结果表明,该RNN达到了1.2 TOp/s的有效吞吐量和164 GOp/s/W的功率效率。与传统的RNN更新相比,增量更新导致5.7倍的加速,因为DN算法创建的稀疏性和DRNN的跳零能力。
{"title":"DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator","authors":"Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, T. Delbrück","doi":"10.1145/3174243.3174261","DOIUrl":"https://doi.org/10.1145/3174243.3174261","url":null,"abstract":"Recurrent Neural Networks (RNNs) are widely used in speech recognition and natural language processing applications because of their capability to process temporal sequences. Because RNNs are fully connected, they require a large number of weight memory accesses, leading to high power consumption. Recent theory has shown that an RNN delta network update approach can reduce memory access and computes with negligible accuracy loss. This paper describes the implementation of this theoretical approach in a hardware accelerator called \"DeltaRNN\" (DRNN). The DRNN updates the output of a neuron only when the neuron»s activation changes by more than a delta threshold. It was implemented on a Xilinx Zynq-7100 FPGA. FPGA measurement results from a single-layer RNN of 256 Gated Recurrent Unit (GRU) neurons show that the DRNN achieves 1.2 TOp/s effective throughput and 164 GOp/s/W power efficiency. The delta update leads to a 5.7x speedup compared to a conventional RNN update because of the sparsity created by the DN algorithm and the zero-skipping ability of DRNN.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"147 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126913804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 108
Improving FPGA Performance with a S44 LUT Structure 利用S44 LUT结构改进FPGA性能
Wenyi Feng, J. Greene, A. Mishchenko
FPGA performance depends in part on the choice of basic logic cell. Previous work dating back to 1999-2005 found that the best look-up table (LUT) sizes for area-delay product are 4-6, with 4 better for area and 6 for performance. Since that time several things have changed. A new 'LUT structure' mapping technique can target cells with a larger number of inputs (cut size) without assuming that the cell implements all possible functions of those inputs. We consider in particular a 7-input function composed of two tightly-coupled 4-input LUTs. Changes in process technology have increased the relative importance of wiring delay and configuration memory area. Finally, modern benchmark applications include carry chains, math and memory blocks. Due to these changes, we show that mapping to a 7-input LUT structure can approach the performance of 6-input LUTs while retaining the area and static power advantage of 4-input LUTs.
FPGA的性能部分取决于基本逻辑单元的选择。以前的工作可以追溯到1999-2005年,发现区域延迟产品的最佳查找表(LUT)大小为4-6,其中4为面积较好,6为性能较好。从那时起,一些事情发生了变化。一种新的“LUT结构”映射技术可以针对具有大量输入(切割尺寸)的细胞,而无需假设该细胞实现这些输入的所有可能功能。我们特别考虑一个由两个紧密耦合的4输入lut组成的7输入函数。工艺技术的变化增加了布线延迟和配置内存区域的相对重要性。最后,现代基准测试应用包括进位链、数学和内存块。由于这些变化,我们表明映射到7输入LUT结构可以接近6输入LUT的性能,同时保留4输入LUT的面积和静态功率优势。
{"title":"Improving FPGA Performance with a S44 LUT Structure","authors":"Wenyi Feng, J. Greene, A. Mishchenko","doi":"10.1145/3174243.3174272","DOIUrl":"https://doi.org/10.1145/3174243.3174272","url":null,"abstract":"FPGA performance depends in part on the choice of basic logic cell. Previous work dating back to 1999-2005 found that the best look-up table (LUT) sizes for area-delay product are 4-6, with 4 better for area and 6 for performance. Since that time several things have changed. A new 'LUT structure' mapping technique can target cells with a larger number of inputs (cut size) without assuming that the cell implements all possible functions of those inputs. We consider in particular a 7-input function composed of two tightly-coupled 4-input LUTs. Changes in process technology have increased the relative importance of wiring delay and configuration memory area. Finally, modern benchmark applications include carry chains, math and memory blocks. Due to these changes, we show that mapping to a 7-input LUT structure can approach the performance of 6-input LUTs while retaining the area and static power advantage of 4-input LUTs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127930833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Session details: Session 7: Circuits and Computation Engines 会议详情:第七部分:电路和计算引擎
Nachiket Kapre
{"title":"Session details: Session 7: Circuits and Computation Engines","authors":"Nachiket Kapre","doi":"10.1145/3252942","DOIUrl":"https://doi.org/10.1145/3252942","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133209284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamically Scheduled High-level Synthesis 动态安排的高级综合
Lana Josipović, Radhika Ghosal, P. Ienne
High-level synthesis (HLS) tools almost universally generate statically scheduled datapaths. Static scheduling implies that circuits out of HLS tools have a hard time exploiting parallelism in code with potential memory dependencies, with control-dependent dependencies in inner loops, or where performance is limited by long latency control decisions. The situation is essentially the same as in computer architecture between Very-Long Instruction Word (VLIW) processors and dynamically scheduled superscalar processors; the former display the best performance per cost in highly regular embedded applications, but general purpose, irregular, and control-dominated computing tasks require the runtime flexibility of dynamic scheduling. In this work, we show that high-level synthesis of dynamically scheduled circuits is perfectly feasible by describing the implementation of a prototype synthesizer which generates a particular form of latency-insensitive synchronous circuits. Compared to a commercial HLS tool, the result is a different trade-off between performance and circuit complexity, much as superscalar processors represent a different trade-off compared to VLIW processors: in demanding applications, the performance is very significantly improved at an affordable cost. We here demonstrate only the first steps towards more performant high-level synthesis tools adapted to emerging FPGA applications and the demands of computing in broader application domains.
高级合成(HLS)工具几乎普遍生成静态调度的数据路径。静态调度意味着HLS工具之外的电路很难利用代码中的并行性,这些代码具有潜在的内存依赖关系、内部循环中的控件依赖关系,或者性能受到长延迟控制决策的限制。这种情况与计算机体系结构中超长指令字(VLIW)处理器和动态调度的超标量处理器之间的情况基本相同;前者在高度规则的嵌入式应用程序中显示出最佳的每成本性能,但一般用途、不规则和控制主导的计算任务需要动态调度的运行时灵活性。在这项工作中,我们通过描述生成特定形式的延迟不敏感同步电路的原型合成器的实现,表明动态调度电路的高级综合是完全可行的。与商业HLS工具相比,结果是性能和电路复杂性之间的不同权衡,就像标量处理器与VLIW处理器相比代表了不同的权衡一样:在要求苛刻的应用程序中,性能以可承受的成本得到了显着提高。我们在这里只展示了迈向高性能高级合成工具的第一步,这些工具适合新兴的FPGA应用和更广泛应用领域的计算需求。
{"title":"Dynamically Scheduled High-level Synthesis","authors":"Lana Josipović, Radhika Ghosal, P. Ienne","doi":"10.1145/3174243.3174264","DOIUrl":"https://doi.org/10.1145/3174243.3174264","url":null,"abstract":"High-level synthesis (HLS) tools almost universally generate statically scheduled datapaths. Static scheduling implies that circuits out of HLS tools have a hard time exploiting parallelism in code with potential memory dependencies, with control-dependent dependencies in inner loops, or where performance is limited by long latency control decisions. The situation is essentially the same as in computer architecture between Very-Long Instruction Word (VLIW) processors and dynamically scheduled superscalar processors; the former display the best performance per cost in highly regular embedded applications, but general purpose, irregular, and control-dominated computing tasks require the runtime flexibility of dynamic scheduling. In this work, we show that high-level synthesis of dynamically scheduled circuits is perfectly feasible by describing the implementation of a prototype synthesizer which generates a particular form of latency-insensitive synchronous circuits. Compared to a commercial HLS tool, the result is a different trade-off between performance and circuit complexity, much as superscalar processors represent a different trade-off compared to VLIW processors: in demanding applications, the performance is very significantly improved at an affordable cost. We here demonstrate only the first steps towards more performant high-level synthesis tools adapted to emerging FPGA applications and the demands of computing in broader application domains.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130115831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs C-LSTM:在fpga上使用结构化压缩技术实现高效LSTM
Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, Yun Liang
Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on FPGAs due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on FPGAs. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from $mathcalO (k^2)$ to $mathcalO (k)$. Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from $mathcalO (k^2)$ to $mathcalO (ktextlog k)$. The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.
近年来,通过增加长短期记忆(LSTM)网络的模型尺寸,声学识别系统的精度得到了显著提高。不幸的是,由于片上资源有限,LSTM模型尺寸的不断增加导致fpga设计效率低下。先前的工作提出使用基于修剪的压缩技术来减小模型大小,从而加快fpga上的推理。然而,剪枝技术的随机性将模型的密集矩阵转化为高度非结构化的稀疏矩阵,导致计算不平衡和内存访问不规则,从而影响整体性能和能源效率。相比之下,我们提出使用结构化压缩技术,不仅可以减少LSTM模型的大小,还可以消除计算和内存访问的不规则性。这种方法使用块循环而不是稀疏矩阵来压缩权重矩阵,并将存储需求从$mathcalO (k^2)$减少到$mathcalO (k)$。利用快速傅立叶变换算法将计算复杂度从$mathcalO (k^2)$降低到$mathcalO (ktextlog k)$,进一步加快了推理速度。数据路径和激活函数被量化为16位,以提高资源利用率。更重要的是,我们提出了一个名为C-LSTM的综合框架,用于在fpga上自动优化和实现各种LSTM变体。根据实验结果,在相同的实验设置下,C-LSTM在性能和能效方面分别达到了目前最先进的LSTM的18.8倍和33.5倍,并且精度下降非常小。
{"title":"C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs","authors":"Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, Yun Liang","doi":"10.1145/3174243.3174253","DOIUrl":"https://doi.org/10.1145/3174243.3174253","url":null,"abstract":"Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on FPGAs due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on FPGAs. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from $mathcalO (k^2)$ to $mathcalO (k)$. Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from $mathcalO (k^2)$ to $mathcalO (ktextlog k)$. The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131944511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 178
ADAM: Automated Design Analysis and Merging for Speeding up FPGA Development ADAM:加速FPGA开发的自动化设计分析和合并
Ho-Cheung Ng, Shuanglong Liu, W. Luk
This paper introduces ADAM, an approach for merging multiple FPGA designs into a single hardware design, so that multiple place-and-route tasks can be replaced by a single task to speed up functional evaluation of designs, especially during the development process. ADAM has three key elements. First, a novel approximate maximum common subgraph detection algorithm with linear time complexity to maximize sharing of resources in the merged design. Second, a prototype tool implementing this common subgraph detection algorithm for dataflow graphs derived from Verilog designs; this tool would also generate the appropriate control circuits to enable selection of the original designs at runtime. Third, a comprehensive analysis of compilation time versus degree of similarity to identify the optimized user parameters for the proposed approach. Experimental results show that ADAM can reduce compilation time by around 5 times when each design is 95% similar to the others, and the compilation time is reduced from 1 hour to 10 minutes in the case of binomial filters.
ADAM是一种将多个FPGA设计合并到单个硬件设计中的方法,可以用单个任务代替多个放置和路由任务,从而加快设计的功能评估,特别是在开发过程中。ADAM有三个关键要素。首先,提出了一种具有线性时间复杂度的近似最大公子图检测算法,使合并设计中的资源共享最大化。其次,为Verilog设计的数据流图实现这种常见子图检测算法的原型工具;该工具还将生成适当的控制电路,以便在运行时选择原始设计。第三,对编译时间和相似度进行综合分析,以确定所提出方法的优化用户参数。实验结果表明,当每个设计与其他设计相似度达到95%时,ADAM可以将编译时间减少约5倍,在二项滤波器的情况下,编译时间从1小时减少到10分钟。
{"title":"ADAM: Automated Design Analysis and Merging for Speeding up FPGA Development","authors":"Ho-Cheung Ng, Shuanglong Liu, W. Luk","doi":"10.1145/3174243.3174247","DOIUrl":"https://doi.org/10.1145/3174243.3174247","url":null,"abstract":"This paper introduces ADAM, an approach for merging multiple FPGA designs into a single hardware design, so that multiple place-and-route tasks can be replaced by a single task to speed up functional evaluation of designs, especially during the development process. ADAM has three key elements. First, a novel approximate maximum common subgraph detection algorithm with linear time complexity to maximize sharing of resources in the merged design. Second, a prototype tool implementing this common subgraph detection algorithm for dataflow graphs derived from Verilog designs; this tool would also generate the appropriate control circuits to enable selection of the original designs at runtime. Third, a comprehensive analysis of compilation time versus degree of similarity to identify the optimized user parameters for the proposed approach. Experimental results show that ADAM can reduce compilation time by around 5 times when each design is 95% similar to the others, and the compilation time is reduced from 1 hour to 10 minutes in the case of binomial filters.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133840358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1