2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文中文

Hardware Acceleration of Large Scale GCN Inference 大规模GCN推理的硬件加速

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00019

Bingyi Zhang, Hanqing Zeng, V. Prasanna

Graph Convolutional Networks (GCNs) have become state-of-the-art deep learning models for representation learning on graphs. Hardware acceleration of GCN inference is challenging due to: 1) massive size of the input graph, 2) heterogeneous workload of the GCN inference that consists of sparse and dense matrix operations, and 3) irregular information propagation along the edges during the computation. To address the above challenges, we propose the algorithm-architecture co-optimization to accelerate large-scale GCN inference on FPGA. We first perform data partitioning to fit each partition in the limited on-chip memory. Then, we use a two-phase preprocessing algorithm consisting of sparsification and node reordering. The first phase (sparsification) eliminates edge connections of high-degree nodes by merging common neighbor nodes. The second phase (re-ordering) effectively groups adjacent nodes to improve on-chip data reuse. Incorporating the above algorithmic optimizations, we propose a generic FPGA architecture to pipeline the two major computational kernels in GCN: aggregation and transformation. The flexible data path and task scheduling strategy of our design support various GCN models and lead to high throughput inference. We evaluate our design on state-of-the-art FPGA platform using three large scale datasets: Flickr, Reddit, Yelp. Compared with the state-of-the-art multi-core and GPU baselines, our design improves the throughput by up to $30 times$ and $2 times$ respectively.

图卷积网络(GCNs)已经成为最先进的深度学习模型，用于图的表示学习。GCN推理的硬件加速面临以下挑战:1)输入图的巨大规模;2)GCN推理由稀疏和密集矩阵运算组成的异构工作负载;3)计算过程中沿边缘的不规则信息传播。为了解决上述挑战，我们提出了算法架构协同优化来加速FPGA上的大规模GCN推理。我们首先执行数据分区，以适应有限的片上内存中的每个分区。然后，我们使用了一种由稀疏化和节点重排序组成的两阶段预处理算法。第一阶段(稀疏化)通过合并共同邻居节点来消除高节点的边缘连接。第二阶段(重新排序)有效地对相邻节点进行分组，以提高片上数据的重用。结合上述算法优化，我们提出了一种通用的FPGA架构来流水线GCN中的两个主要计算内核:聚合和转换。我们设计的灵活的数据路径和任务调度策略支持多种GCN模型，从而实现高吞吐量推理。我们在最先进的FPGA平台上评估我们的设计，使用三个大规模数据集:Flickr, Reddit, Yelp。与最先进的多核和GPU基准相比，我们的设计将吞吐量分别提高了30倍和2倍。

{"title":"Hardware Acceleration of Large Scale GCN Inference","authors":"Bingyi Zhang, Hanqing Zeng, V. Prasanna","doi":"10.1109/ASAP49362.2020.00019","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00019","url":null,"abstract":"Graph Convolutional Networks (GCNs) have become state-of-the-art deep learning models for representation learning on graphs. Hardware acceleration of GCN inference is challenging due to: 1) massive size of the input graph, 2) heterogeneous workload of the GCN inference that consists of sparse and dense matrix operations, and 3) irregular information propagation along the edges during the computation. To address the above challenges, we propose the algorithm-architecture co-optimization to accelerate large-scale GCN inference on FPGA. We first perform data partitioning to fit each partition in the limited on-chip memory. Then, we use a two-phase preprocessing algorithm consisting of sparsification and node reordering. The first phase (sparsification) eliminates edge connections of high-degree nodes by merging common neighbor nodes. The second phase (re-ordering) effectively groups adjacent nodes to improve on-chip data reuse. Incorporating the above algorithmic optimizations, we propose a generic FPGA architecture to pipeline the two major computational kernels in GCN: aggregation and transformation. The flexible data path and task scheduling strategy of our design support various GCN models and lead to high throughput inference. We evaluate our design on state-of-the-art FPGA platform using three large scale datasets: Flickr, Reddit, Yelp. Compared with the state-of-the-art multi-core and GPU baselines, our design improves the throughput by up to $30 times$ and $2 times$ respectively.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122844767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Low-Cost DNN Hardware Accelerator for Wearable, High-Quality Cardiac Arrythmia Detection 低成本DNN硬件加速器可穿戴，高品质的心律失常检测

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00042

Johnson Loh, J. Wen, T. Gemmeke

This work implements a digital signal processing (DSP) accelerator for ECG signal classification. Targeting the integration into wearable devices for 24/7 monitoring, low energy consumption per classification is a key requirement, while maintaining a high classification accuracy at the same time. Co-optimization on algorithm and hardware level led to an architecture consisting mostly of convolution operations in the processing pipeline. The realized discrete wavelet transform and convolutional neural network (CNN) is utilized for continuous time-sequence classification in a sliding-window approach moving away from sample/batch-based processing typical for CNNs. In contrast to previous hardware realizations in this domain, the proposed design was validated using the benchmark dataset from the demanding CinC challenge 2017. The architecture achieves a competitive 0.781 Fl-score with only 5597 trainable parameters reducing the computational complexity of state-of-the-art ECGDNN software solutions by three orders of magnitude. Synthesis in a 22-nm FDSOI-CMOS technology features 0.783 $mu$J per solution meeting requirements for edge device operation at high-end classification performance.

本工作实现了一种用于心电信号分类的数字信号处理(DSP)加速器。针对可穿戴设备进行24/7全天候监控的目标，低分类能耗是一个关键要求，同时保持高分类精度。在算法和硬件层面进行协同优化，形成了以处理流水线中的卷积操作为主的体系结构。所实现的离散小波变换和卷积神经网络(CNN)以滑动窗口的方式用于连续时间序列分类，摆脱了CNN典型的基于样本/批处理的方法。与该领域之前的硬件实现相比，所提出的设计使用来自苛刻的2017年CinC挑战的基准数据集进行了验证。该架构仅用5597个可训练参数实现了0.781的fl分数，将最先进的ECGDNN软件解决方案的计算复杂性降低了三个数量级。采用22纳米FDSOI-CMOS技术合成的每个解决方案的特性为0.783 $mu$J，满足高端分类性能的边缘器件操作要求。

{"title":"Low-Cost DNN Hardware Accelerator for Wearable, High-Quality Cardiac Arrythmia Detection","authors":"Johnson Loh, J. Wen, T. Gemmeke","doi":"10.1109/ASAP49362.2020.00042","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00042","url":null,"abstract":"This work implements a digital signal processing (DSP) accelerator for ECG signal classification. Targeting the integration into wearable devices for 24/7 monitoring, low energy consumption per classification is a key requirement, while maintaining a high classification accuracy at the same time. Co-optimization on algorithm and hardware level led to an architecture consisting mostly of convolution operations in the processing pipeline. The realized discrete wavelet transform and convolutional neural network (CNN) is utilized for continuous time-sequence classification in a sliding-window approach moving away from sample/batch-based processing typical for CNNs. In contrast to previous hardware realizations in this domain, the proposed design was validated using the benchmark dataset from the demanding CinC challenge 2017. The architecture achieves a competitive 0.781 Fl-score with only 5597 trainable parameters reducing the computational complexity of state-of-the-art ECGDNN software solutions by three orders of magnitude. Synthesis in a 22-nm FDSOI-CMOS technology features 0.783 $mu$J per solution meeting requirements for edge device operation at high-end classification performance.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121977824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

SLATE: Managing Heterogeneous Cloud Functions SLATE:管理异构云功能

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00032

Jessica Vandebon, J. Coutinho, W. Luk, E. Nurvitadhi, Mishali Naik

This paper presents SLATE, a fully-managed, heterogeneous Function-as-a-Service (FaaS) system for deploying serverless functions onto heterogeneous cloud infrastructures. We extend the traditional homogeneous FaaS execution model to support heterogeneous functions, automating and abstracting runtime management of heterogeneous compute resources in order to improve cloud tenant accessibility to specialised, accelerator resources, such as FPGAs and GPUs. In particular, we focus on the mechanisms required for heterogeneous scaling of deployed function instances to guarantee latency objectives while minimising cost. We develop a simulator to validate and evaluate our approach, considering case-study functions in three application domains: machine learning, bio-informatics, and physics. We incorporate empirically derived performance models for each function implementation targeting a hardware platform with combined computational capacity of 24 FPGAs and 12 CPU cores. Compared to homogeneous CPU and homogeneous FPGA functions, simulation results achieve respectively a cost improvement for non-uniform task traffic of up to 8.7 times and 1.7 times, while maintaining specified latency objectives.

本文介绍了SLATE，一个完全托管的异构功能即服务(FaaS)系统，用于在异构云基础设施上部署无服务器功能。我们扩展了传统的同构FaaS执行模型，以支持异构功能，自动化和抽象异构计算资源的运行时管理，以提高云租户对专用加速器资源(如fpga和gpu)的可访问性。我们特别关注部署功能实例的异构扩展所需的机制，以保证延迟目标，同时最小化成本。我们开发了一个模拟器来验证和评估我们的方法，考虑到三个应用领域的案例研究功能:机器学习，生物信息学和物理学。我们结合了经验推导的性能模型，针对每个功能实现的硬件平台，具有24个fpga和12个CPU核心的综合计算能力。与同构CPU和同构FPGA功能相比，仿真结果在保持指定延迟目标的情况下，对非均匀任务流量分别实现了高达8.7倍和1.7倍的成本改进。

{"title":"SLATE: Managing Heterogeneous Cloud Functions","authors":"Jessica Vandebon, J. Coutinho, W. Luk, E. Nurvitadhi, Mishali Naik","doi":"10.1109/ASAP49362.2020.00032","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00032","url":null,"abstract":"This paper presents SLATE, a fully-managed, heterogeneous Function-as-a-Service (FaaS) system for deploying serverless functions onto heterogeneous cloud infrastructures. We extend the traditional homogeneous FaaS execution model to support heterogeneous functions, automating and abstracting runtime management of heterogeneous compute resources in order to improve cloud tenant accessibility to specialised, accelerator resources, such as FPGAs and GPUs. In particular, we focus on the mechanisms required for heterogeneous scaling of deployed function instances to guarantee latency objectives while minimising cost. We develop a simulator to validate and evaluate our approach, considering case-study functions in three application domains: machine learning, bio-informatics, and physics. We incorporate empirically derived performance models for each function implementation targeting a hardware platform with combined computational capacity of 24 FPGAs and 12 CPU cores. Compared to homogeneous CPU and homogeneous FPGA functions, simulation results achieve respectively a cost improvement for non-uniform task traffic of up to 8.7 times and 1.7 times, while maintaining specified latency objectives.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129355153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Hamamu: Specializing FPGAs for ML Applications by Adding Hard Matrix Multiplier Blocks Hamamu:通过添加硬矩阵乘法器块，专门为ML应用提供fpga

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00018

Aman Arora, Zhigang Wei, L. John

Designing efficient hardware for accelerating artificial intelligence (AI) and machine learning (ML) applications is a major challenge. Rapidly changing algorithms and neural network architectures make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (Logic Blocks (LBs), multipliers, DSP blocks) limit the acceleration that can be achieved. We propose Hamamu, a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighboring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes ($2times 2times 2$, $4times 4times 4$, $8times 8times 8$, $16times 16times 16$) and various strategies to place these blocks on the FPGA (Columnar, Surround, Hybrid). We find that providing $4times 4times 4$ hard matrix multiplier blocks in an FPGA speeds up neural networks from MLPerf benchmarks by up to $sim 3.9x$, compared to a Stratix-10 like FPGA with equal number of MACs, same MAC architecture and high DSP:LB ratio. Although the flexibility of the FPGA will reduce for non-ML applications, an FPGA with hard matrix multipliers is a faster, and more area efficient hardware accelerator for ML applications, compared to current FPGAs.

为加速人工智能(AI)和机器学习(ML)应用设计高效的硬件是一项重大挑战。快速变化的算法和神经网络架构使得基于FPGA的设计成为一个有吸引力的解决方案。但是，当前fpga中可用的通用构建块(逻辑块(LBs)，乘法器，DSP块)限制了可以实现的加速。我们提出了Hamamu，这是对当前FPGA架构的修改，使FPGA专门用于ML应用。具体来说，我们建议将硬矩阵乘法器块(matmuls)添加到FPGA结构中。这些矩阵使用mac的收缩阵列(乘法累加)来实现，并且可以使用相邻矩阵之间的可编程直接互连来连接，以形成更大的收缩矩阵乘法器。我们探索了各种matl大小($2 × 2 × 2$， $4 × 4 × 4$， $8 × 8 × 8$， $16 × 16 × 16$)以及将这些块放置在FPGA上的各种策略(Columnar, Surround, Hybrid)。我们发现，与具有相同MAC数量、相同MAC架构和高DSP:LB比率的Stratix-10类FPGA相比，在FPGA中提供$4times 4times 4$硬矩阵乘法器块可使神经网络从MLPerf基准中加速高达$ 3.9x$。尽管FPGA的灵活性在非机器学习应用中会降低，但与目前的FPGA相比，具有硬矩阵乘法器的FPGA对于机器学习应用来说是一个更快、更高效的硬件加速器。

{"title":"Hamamu: Specializing FPGAs for ML Applications by Adding Hard Matrix Multiplier Blocks","authors":"Aman Arora, Zhigang Wei, L. John","doi":"10.1109/ASAP49362.2020.00018","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00018","url":null,"abstract":"Designing efficient hardware for accelerating artificial intelligence (AI) and machine learning (ML) applications is a major challenge. Rapidly changing algorithms and neural network architectures make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (Logic Blocks (LBs), multipliers, DSP blocks) limit the acceleration that can be achieved. We propose Hamamu, a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighboring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes ($2times 2times 2$, $4times 4times 4$, $8times 8times 8$, $16times 16times 16$) and various strategies to place these blocks on the FPGA (Columnar, Surround, Hybrid). We find that providing $4times 4times 4$ hard matrix multiplier blocks in an FPGA speeds up neural networks from MLPerf benchmarks by up to $sim 3.9x$, compared to a Stratix-10 like FPGA with equal number of MACs, same MAC architecture and high DSP:LB ratio. Although the flexibility of the FPGA will reduce for non-ML applications, an FPGA with hard matrix multipliers is a faster, and more area efficient hardware accelerator for ML applications, compared to current FPGAs.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132967372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Optimizing Grouped Convolutions on Edge Devices 在边缘设备上优化分组卷积

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-06-17 DOI: 10.1109/ASAP49362.2020.00039

Perry Gibson, José Cano, Jack Turner, Elliot J. Crowley, M. O’Boyle, A. Storkey

When deploying a deep neural network on con-strained hardware, it is possible to replace the network’s standard convolutions with grouped convolutions. This allows for substantial memory savings with minimal loss of accuracy. However, current implementations of grouped convolutions in modern deep learning frameworks are far from performing optimally in terms of speed. In this paper we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of grouped convolutions that outperforms existing solutions. We implement GSPC in TVM, which provides state-of-the-art performance on edge devices. We analyze a set of networks utilizing different types of grouped convolutions and evaluate their performance in terms of inference time on several edge devices. We observe that our new implementation scales well with the number of groups and provides the best inference times in all settings, improving the existing implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by $3.4times, 8times$ and $ 4times$ on average respectively. Code is available at https://github.com/gecLAB/tvm-GSPC/

当在受限硬件上部署深度神经网络时，可以用分组卷积代替网络的标准卷积。这允许大量的内存节省和最小的准确性损失。然而，就速度而言，现代深度学习框架中分组卷积的当前实现远未达到最佳效果。在本文中，我们提出了分组空间包卷积(GSPC)，这是一种优于现有解决方案的分组卷积新实现。我们在TVM中实现GSPC，它在边缘设备上提供了最先进的性能。我们分析了一组使用不同类型的分组卷积的网络，并根据在几个边缘设备上的推理时间评估了它们的性能。我们观察到，我们的新实现可以很好地扩展组的数量，并在所有设置下提供最佳的推理时间，平均分别将TVM, PyTorch和TensorFlow Lite中分组卷积的现有实现提高了3.4倍，8倍和4倍。代码可从https://github.com/gecLAB/tvm-GSPC/获得

{"title":"Optimizing Grouped Convolutions on Edge Devices","authors":"Perry Gibson, José Cano, Jack Turner, Elliot J. Crowley, M. O’Boyle, A. Storkey","doi":"10.1109/ASAP49362.2020.00039","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00039","url":null,"abstract":"When deploying a deep neural network on con-strained hardware, it is possible to replace the network’s standard convolutions with grouped convolutions. This allows for substantial memory savings with minimal loss of accuracy. However, current implementations of grouped convolutions in modern deep learning frameworks are far from performing optimally in terms of speed. In this paper we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of grouped convolutions that outperforms existing solutions. We implement GSPC in TVM, which provides state-of-the-art performance on edge devices. We analyze a set of networks utilizing different types of grouped convolutions and evaluate their performance in terms of inference time on several edge devices. We observe that our new implementation scales well with the number of groups and provides the best inference times in all settings, improving the existing implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by $3.4times, 8times$ and $ 4times$ on average respectively. Code is available at https://github.com/gecLAB/tvm-GSPC/","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128275189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

A Design Methodology for Post-Moore’s Law Accelerators: The Case of a Photonic Neuromorphic Processor 后摩尔定律加速器的设计方法:以光子神经形态处理器为例

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-06-15 DOI: 10.1109/ASAP49362.2020.00028

A. Mehrabian, V. Sorger, T. El-Ghazawi

Over the past decade alternative technologies have gained momentum as conventional digital electronics continue to approach their limitations, due to the end of Moore’s Law and Dennard Scaling. At the same time, we are facing new application challenges such as those due to the enormous increase in data. The attention, has therefore, shifted from homogeneous computing to specialized heterogeneous solutions. As an example, brain-inspired computing has re-emerged as a viable solution for many applications. Such new processors, however, have widened the abstraction gamut from device level to applications. Therefore, efficient abstractions that can provide vertical design-flow tools for such technologies became critical. Photonics in general, and neuromorphic photonics in particular, are among the promising alternatives to electronics. While the arsenal of device level toolbox for photonics, and high-level neural network platforms are rapidly expanding, there has not been much work to bridge this gap. Here, we present a design methodology to mitigate this problem by extending high-level hardware-agnostic neural network design tools with functional and performance models of photonic components. In this paper we detail this tool and methodology by using design examples and associated results. We show that adopting this approach enables designers to efficiently navigate the design space and devise hardware-aware systems with alternative technologies.

在过去的十年中，由于摩尔定律和登纳德缩放的终结，传统的数字电子产品继续接近其极限，替代技术获得了动力。与此同时，我们也面临着新的应用挑战，比如数据量的巨大增长带来的挑战。因此，注意力已经从同构计算转移到专门的异构解决方案。例如，大脑启发的计算已经重新成为许多应用程序的可行解决方案。然而，这样的新处理器已经将抽象范围从设备级扩展到应用程序。因此，能够为这些技术提供垂直设计流工具的高效抽象变得至关重要。一般来说，光子学，特别是神经形态光子学，是电子学的有前途的替代品之一。虽然光子学的设备级工具箱和高级神经网络平台的武器库正在迅速扩大，但还没有太多的工作来弥合这一差距。在这里，我们提出了一种设计方法，通过扩展具有光子组件功能和性能模型的高级硬件无关神经网络设计工具来缓解这一问题。在本文中，我们通过使用设计实例和相关结果详细介绍了该工具和方法。我们表明，采用这种方法使设计人员能够有效地导航设计空间，并使用替代技术设计硬件感知系统。

{"title":"A Design Methodology for Post-Moore’s Law Accelerators: The Case of a Photonic Neuromorphic Processor","authors":"A. Mehrabian, V. Sorger, T. El-Ghazawi","doi":"10.1109/ASAP49362.2020.00028","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00028","url":null,"abstract":"Over the past decade alternative technologies have gained momentum as conventional digital electronics continue to approach their limitations, due to the end of Moore’s Law and Dennard Scaling. At the same time, we are facing new application challenges such as those due to the enormous increase in data. The attention, has therefore, shifted from homogeneous computing to specialized heterogeneous solutions. As an example, brain-inspired computing has re-emerged as a viable solution for many applications. Such new processors, however, have widened the abstraction gamut from device level to applications. Therefore, efficient abstractions that can provide vertical design-flow tools for such technologies became critical. Photonics in general, and neuromorphic photonics in particular, are among the promising alternatives to electronics. While the arsenal of device level toolbox for photonics, and high-level neural network platforms are rapidly expanding, there has not been much work to bridge this gap. Here, we present a design methodology to mitigate this problem by extending high-level hardware-agnostic neural network design tools with functional and performance models of photonic components. In this paper we detail this tool and methodology by using design examples and associated results. We show that adopting this approach enables designers to efficiently navigate the design space and devise hardware-aware systems with alternative technologies.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132214318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Architecture Support for FPGA Multi-tenancy in the Cloud 云环境下FPGA多租户的架构支持

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-06-14 DOI: 10.1109/ASAP49362.2020.00030

Joel Mandebi Mbongue, Alex Shuping, Pankaj Bhowmik, C. Bobda

Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud. The proposed architecture implements a network-on-chip (NoC) designed for fast data movement and low hardware footprint. Prototyping the proposed architecture on a Xilinx Virtex Ultrascale + demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization (we achieved $6 times$ higher FPGA utilization with our case study), which is one of the major goals of virtualization. Overall, our NoC interconnect achieved about $2 times$ higher maximum frequency than the state-of-the-art and a bandwidth of 25.6 Gbps.

云部署现在越来越多地提供FPGA加速器作为虚拟实例的一部分。虽然FPGA本质上仍然是单租户的，但是对硬件加速的不断增长的需求将不可避免地导致对支持FPGA多租户的方法和体系结构的需求。在本文中，我们提出了一种支持FPGA设备在云中的多个租户之间空间共享的架构。该架构实现了一种芯片网络(NoC)，旨在实现快速数据移动和低硬件占用。在Xilinx Virtex Ultrascale +上对所提出的架构进行原型设计，证明了芯片上数据移动的最高频率接近规格，并且在虚拟实例访问硬件加速器时具有高吞吐量。与单租户部署相比，我们展示了类似的性能，同时提高了FPGA利用率(通过我们的案例研究，FPGA利用率提高了6倍)，这是虚拟化的主要目标之一。总体而言，我们的NoC互连实现了比最先进的最大频率高约2倍，带宽为25.6 Gbps。

{"title":"Architecture Support for FPGA Multi-tenancy in the Cloud","authors":"Joel Mandebi Mbongue, Alex Shuping, Pankaj Bhowmik, C. Bobda","doi":"10.1109/ASAP49362.2020.00030","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00030","url":null,"abstract":"Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud. The proposed architecture implements a network-on-chip (NoC) designed for fast data movement and low hardware footprint. Prototyping the proposed architecture on a Xilinx Virtex Ultrascale + demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization (we achieved $6 times$ higher FPGA utilization with our case study), which is one of the major goals of virtualization. Overall, our NoC interconnect achieved about $2 times$ higher maximum frequency than the state-of-the-art and a bandwidth of 25.6 Gbps.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115551152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

A System for Generating Non-Uniform Random Variates using Graphene Field-Effect Transistors 利用石墨烯场效应晶体管产生非均匀随机变量的系统

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-04-28 DOI: 10.1109/ASAP49362.2020.00026

N. Tye, James Timothy Meech, B. Bilgin, Phillip Stanley-Marbell

We introduce a new method for hardware nonuniform random number generation based on the transfer characteristics of graphene field-effect transistors (GFETs) which requires as few as two transistors and a resistor. We implement the method by fabricating multiple GFETs and experimentally validating that their transfer characteristics exhibit the nonlinearity on which our method depends. We use characterisation data in simulations of a proposed architecture for generating samples from dynamically selectable non-uniform probability distributions. The method we present has the potential for Gb/s sample rates, is reconfigurable for arbitrary target distributions, and has a wide range of possible applications. Using a combination of experimental measurements of GFETs under a range of biasing conditions and simulation of the GFET-based non-uniform random variate generator, we demonstrate a speedup of Monte Carlo integration by up to $2 times$. This speedup assumes the analog-to-digital converters reading the outputs from the circuit can produce samples in the same amount of time that it takes to perform memory accesses.

介绍了一种基于石墨烯场效应晶体管(gfet)传输特性的硬件非均匀随机数生成新方法，该方法只需要两个晶体管和一个电阻。我们通过制造多个gfet并通过实验验证它们的转移特性表现出我们的方法所依赖的非线性来实现该方法。我们使用表征数据来模拟从动态选择的非均匀概率分布中生成样本的拟议架构。我们提出的方法具有Gb/s采样率的潜力，可以对任意目标分布进行重新配置，并且具有广泛的应用前景。通过在一系列偏置条件下对gfet的实验测量和基于gfet的非均匀随机变量生成器的模拟，我们证明了蒙特卡罗积分的加速高达$2 乘以$。这种加速假设从电路读取输出的模数转换器可以在执行存储器访问所需的相同时间内产生样本。

{"title":"A System for Generating Non-Uniform Random Variates using Graphene Field-Effect Transistors","authors":"N. Tye, James Timothy Meech, B. Bilgin, Phillip Stanley-Marbell","doi":"10.1109/ASAP49362.2020.00026","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00026","url":null,"abstract":"We introduce a new method for hardware nonuniform random number generation based on the transfer characteristics of graphene field-effect transistors (GFETs) which requires as few as two transistors and a resistor. We implement the method by fabricating multiple GFETs and experimentally validating that their transfer characteristics exhibit the nonlinearity on which our method depends. We use characterisation data in simulations of a proposed architecture for generating samples from dynamically selectable non-uniform probability distributions. The method we present has the potential for Gb/s sample rates, is reconfigurable for arbitrary target distributions, and has a wide range of possible applications. Using a combination of experimental measurements of GFETs under a range of biasing conditions and simulation of the GFET-based non-uniform random variate generator, we demonstrate a speedup of Monte Carlo integration by up to $2 times$. This speedup assumes the analog-to-digital converters reading the outputs from the circuit can produce samples in the same amount of time that it takes to perform memory accesses.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131154364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀