Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00019
Bingyi Zhang, Hanqing Zeng, V. Prasanna
Graph Convolutional Networks (GCNs) have become state-of-the-art deep learning models for representation learning on graphs. Hardware acceleration of GCN inference is challenging due to: 1) massive size of the input graph, 2) heterogeneous workload of the GCN inference that consists of sparse and dense matrix operations, and 3) irregular information propagation along the edges during the computation. To address the above challenges, we propose the algorithm-architecture co-optimization to accelerate large-scale GCN inference on FPGA. We first perform data partitioning to fit each partition in the limited on-chip memory. Then, we use a two-phase preprocessing algorithm consisting of sparsification and node reordering. The first phase (sparsification) eliminates edge connections of high-degree nodes by merging common neighbor nodes. The second phase (re-ordering) effectively groups adjacent nodes to improve on-chip data reuse. Incorporating the above algorithmic optimizations, we propose a generic FPGA architecture to pipeline the two major computational kernels in GCN: aggregation and transformation. The flexible data path and task scheduling strategy of our design support various GCN models and lead to high throughput inference. We evaluate our design on state-of-the-art FPGA platform using three large scale datasets: Flickr, Reddit, Yelp. Compared with the state-of-the-art multi-core and GPU baselines, our design improves the throughput by up to $30 times$ and $2 times$ respectively.
{"title":"Hardware Acceleration of Large Scale GCN Inference","authors":"Bingyi Zhang, Hanqing Zeng, V. Prasanna","doi":"10.1109/ASAP49362.2020.00019","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00019","url":null,"abstract":"Graph Convolutional Networks (GCNs) have become state-of-the-art deep learning models for representation learning on graphs. Hardware acceleration of GCN inference is challenging due to: 1) massive size of the input graph, 2) heterogeneous workload of the GCN inference that consists of sparse and dense matrix operations, and 3) irregular information propagation along the edges during the computation. To address the above challenges, we propose the algorithm-architecture co-optimization to accelerate large-scale GCN inference on FPGA. We first perform data partitioning to fit each partition in the limited on-chip memory. Then, we use a two-phase preprocessing algorithm consisting of sparsification and node reordering. The first phase (sparsification) eliminates edge connections of high-degree nodes by merging common neighbor nodes. The second phase (re-ordering) effectively groups adjacent nodes to improve on-chip data reuse. Incorporating the above algorithmic optimizations, we propose a generic FPGA architecture to pipeline the two major computational kernels in GCN: aggregation and transformation. The flexible data path and task scheduling strategy of our design support various GCN models and lead to high throughput inference. We evaluate our design on state-of-the-art FPGA platform using three large scale datasets: Flickr, Reddit, Yelp. Compared with the state-of-the-art multi-core and GPU baselines, our design improves the throughput by up to $30 times$ and $2 times$ respectively.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122844767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00042
Johnson Loh, J. Wen, T. Gemmeke
This work implements a digital signal processing (DSP) accelerator for ECG signal classification. Targeting the integration into wearable devices for 24/7 monitoring, low energy consumption per classification is a key requirement, while maintaining a high classification accuracy at the same time. Co-optimization on algorithm and hardware level led to an architecture consisting mostly of convolution operations in the processing pipeline. The realized discrete wavelet transform and convolutional neural network (CNN) is utilized for continuous time-sequence classification in a sliding-window approach moving away from sample/batch-based processing typical for CNNs. In contrast to previous hardware realizations in this domain, the proposed design was validated using the benchmark dataset from the demanding CinC challenge 2017. The architecture achieves a competitive 0.781 Fl-score with only 5597 trainable parameters reducing the computational complexity of state-of-the-art ECGDNN software solutions by three orders of magnitude. Synthesis in a 22-nm FDSOI-CMOS technology features 0.783 $mu$J per solution meeting requirements for edge device operation at high-end classification performance.
{"title":"Low-Cost DNN Hardware Accelerator for Wearable, High-Quality Cardiac Arrythmia Detection","authors":"Johnson Loh, J. Wen, T. Gemmeke","doi":"10.1109/ASAP49362.2020.00042","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00042","url":null,"abstract":"This work implements a digital signal processing (DSP) accelerator for ECG signal classification. Targeting the integration into wearable devices for 24/7 monitoring, low energy consumption per classification is a key requirement, while maintaining a high classification accuracy at the same time. Co-optimization on algorithm and hardware level led to an architecture consisting mostly of convolution operations in the processing pipeline. The realized discrete wavelet transform and convolutional neural network (CNN) is utilized for continuous time-sequence classification in a sliding-window approach moving away from sample/batch-based processing typical for CNNs. In contrast to previous hardware realizations in this domain, the proposed design was validated using the benchmark dataset from the demanding CinC challenge 2017. The architecture achieves a competitive 0.781 Fl-score with only 5597 trainable parameters reducing the computational complexity of state-of-the-art ECGDNN software solutions by three orders of magnitude. Synthesis in a 22-nm FDSOI-CMOS technology features 0.783 $mu$J per solution meeting requirements for edge device operation at high-end classification performance.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121977824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00032
Jessica Vandebon, J. Coutinho, W. Luk, E. Nurvitadhi, Mishali Naik
This paper presents SLATE, a fully-managed, heterogeneous Function-as-a-Service (FaaS) system for deploying serverless functions onto heterogeneous cloud infrastructures. We extend the traditional homogeneous FaaS execution model to support heterogeneous functions, automating and abstracting runtime management of heterogeneous compute resources in order to improve cloud tenant accessibility to specialised, accelerator resources, such as FPGAs and GPUs. In particular, we focus on the mechanisms required for heterogeneous scaling of deployed function instances to guarantee latency objectives while minimising cost. We develop a simulator to validate and evaluate our approach, considering case-study functions in three application domains: machine learning, bio-informatics, and physics. We incorporate empirically derived performance models for each function implementation targeting a hardware platform with combined computational capacity of 24 FPGAs and 12 CPU cores. Compared to homogeneous CPU and homogeneous FPGA functions, simulation results achieve respectively a cost improvement for non-uniform task traffic of up to 8.7 times and 1.7 times, while maintaining specified latency objectives.
{"title":"SLATE: Managing Heterogeneous Cloud Functions","authors":"Jessica Vandebon, J. Coutinho, W. Luk, E. Nurvitadhi, Mishali Naik","doi":"10.1109/ASAP49362.2020.00032","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00032","url":null,"abstract":"This paper presents SLATE, a fully-managed, heterogeneous Function-as-a-Service (FaaS) system for deploying serverless functions onto heterogeneous cloud infrastructures. We extend the traditional homogeneous FaaS execution model to support heterogeneous functions, automating and abstracting runtime management of heterogeneous compute resources in order to improve cloud tenant accessibility to specialised, accelerator resources, such as FPGAs and GPUs. In particular, we focus on the mechanisms required for heterogeneous scaling of deployed function instances to guarantee latency objectives while minimising cost. We develop a simulator to validate and evaluate our approach, considering case-study functions in three application domains: machine learning, bio-informatics, and physics. We incorporate empirically derived performance models for each function implementation targeting a hardware platform with combined computational capacity of 24 FPGAs and 12 CPU cores. Compared to homogeneous CPU and homogeneous FPGA functions, simulation results achieve respectively a cost improvement for non-uniform task traffic of up to 8.7 times and 1.7 times, while maintaining specified latency objectives.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129355153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00018
Aman Arora, Zhigang Wei, L. John
Designing efficient hardware for accelerating artificial intelligence (AI) and machine learning (ML) applications is a major challenge. Rapidly changing algorithms and neural network architectures make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (Logic Blocks (LBs), multipliers, DSP blocks) limit the acceleration that can be achieved. We propose Hamamu, a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighboring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes ($2times 2times 2$, $4times 4times 4$, $8times 8times 8$, $16times 16times 16$) and various strategies to place these blocks on the FPGA (Columnar, Surround, Hybrid). We find that providing $4times 4times 4$ hard matrix multiplier blocks in an FPGA speeds up neural networks from MLPerf benchmarks by up to $sim 3.9x$, compared to a Stratix-10 like FPGA with equal number of MACs, same MAC architecture and high DSP:LB ratio. Although the flexibility of the FPGA will reduce for non-ML applications, an FPGA with hard matrix multipliers is a faster, and more area efficient hardware accelerator for ML applications, compared to current FPGAs.
{"title":"Hamamu: Specializing FPGAs for ML Applications by Adding Hard Matrix Multiplier Blocks","authors":"Aman Arora, Zhigang Wei, L. John","doi":"10.1109/ASAP49362.2020.00018","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00018","url":null,"abstract":"Designing efficient hardware for accelerating artificial intelligence (AI) and machine learning (ML) applications is a major challenge. Rapidly changing algorithms and neural network architectures make FPGA based designs an attractive solution. But the generic building blocks available in current FPGAs (Logic Blocks (LBs), multipliers, DSP blocks) limit the acceleration that can be achieved. We propose Hamamu, a modification to the current FPGA architecture that makes FPGAs specialized for ML applications. Specifically, we propose adding hard matrix multiplier blocks (matmuls) into the FPGA fabric. These matmuls are implemented using systolic arrays of MACs (Multiply-And-Accumulate) and can be connected using programmable direct interconnect between neighboring matmuls to make larger systolic matrix multipliers. We explore various matmul sizes ($2times 2times 2$, $4times 4times 4$, $8times 8times 8$, $16times 16times 16$) and various strategies to place these blocks on the FPGA (Columnar, Surround, Hybrid). We find that providing $4times 4times 4$ hard matrix multiplier blocks in an FPGA speeds up neural networks from MLPerf benchmarks by up to $sim 3.9x$, compared to a Stratix-10 like FPGA with equal number of MACs, same MAC architecture and high DSP:LB ratio. Although the flexibility of the FPGA will reduce for non-ML applications, an FPGA with hard matrix multipliers is a faster, and more area efficient hardware accelerator for ML applications, compared to current FPGAs.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132967372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-17DOI: 10.1109/ASAP49362.2020.00039
Perry Gibson, José Cano, Jack Turner, Elliot J. Crowley, M. O’Boyle, A. Storkey
When deploying a deep neural network on con-strained hardware, it is possible to replace the network’s standard convolutions with grouped convolutions. This allows for substantial memory savings with minimal loss of accuracy. However, current implementations of grouped convolutions in modern deep learning frameworks are far from performing optimally in terms of speed. In this paper we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of grouped convolutions that outperforms existing solutions. We implement GSPC in TVM, which provides state-of-the-art performance on edge devices. We analyze a set of networks utilizing different types of grouped convolutions and evaluate their performance in terms of inference time on several edge devices. We observe that our new implementation scales well with the number of groups and provides the best inference times in all settings, improving the existing implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by $3.4times, 8times$ and $ 4times$ on average respectively. Code is available at https://github.com/gecLAB/tvm-GSPC/
{"title":"Optimizing Grouped Convolutions on Edge Devices","authors":"Perry Gibson, José Cano, Jack Turner, Elliot J. Crowley, M. O’Boyle, A. Storkey","doi":"10.1109/ASAP49362.2020.00039","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00039","url":null,"abstract":"When deploying a deep neural network on con-strained hardware, it is possible to replace the network’s standard convolutions with grouped convolutions. This allows for substantial memory savings with minimal loss of accuracy. However, current implementations of grouped convolutions in modern deep learning frameworks are far from performing optimally in terms of speed. In this paper we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of grouped convolutions that outperforms existing solutions. We implement GSPC in TVM, which provides state-of-the-art performance on edge devices. We analyze a set of networks utilizing different types of grouped convolutions and evaluate their performance in terms of inference time on several edge devices. We observe that our new implementation scales well with the number of groups and provides the best inference times in all settings, improving the existing implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by $3.4times, 8times$ and $ 4times$ on average respectively. Code is available at https://github.com/gecLAB/tvm-GSPC/","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128275189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-15DOI: 10.1109/ASAP49362.2020.00028
A. Mehrabian, V. Sorger, T. El-Ghazawi
Over the past decade alternative technologies have gained momentum as conventional digital electronics continue to approach their limitations, due to the end of Moore’s Law and Dennard Scaling. At the same time, we are facing new application challenges such as those due to the enormous increase in data. The attention, has therefore, shifted from homogeneous computing to specialized heterogeneous solutions. As an example, brain-inspired computing has re-emerged as a viable solution for many applications. Such new processors, however, have widened the abstraction gamut from device level to applications. Therefore, efficient abstractions that can provide vertical design-flow tools for such technologies became critical. Photonics in general, and neuromorphic photonics in particular, are among the promising alternatives to electronics. While the arsenal of device level toolbox for photonics, and high-level neural network platforms are rapidly expanding, there has not been much work to bridge this gap. Here, we present a design methodology to mitigate this problem by extending high-level hardware-agnostic neural network design tools with functional and performance models of photonic components. In this paper we detail this tool and methodology by using design examples and associated results. We show that adopting this approach enables designers to efficiently navigate the design space and devise hardware-aware systems with alternative technologies.
{"title":"A Design Methodology for Post-Moore’s Law Accelerators: The Case of a Photonic Neuromorphic Processor","authors":"A. Mehrabian, V. Sorger, T. El-Ghazawi","doi":"10.1109/ASAP49362.2020.00028","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00028","url":null,"abstract":"Over the past decade alternative technologies have gained momentum as conventional digital electronics continue to approach their limitations, due to the end of Moore’s Law and Dennard Scaling. At the same time, we are facing new application challenges such as those due to the enormous increase in data. The attention, has therefore, shifted from homogeneous computing to specialized heterogeneous solutions. As an example, brain-inspired computing has re-emerged as a viable solution for many applications. Such new processors, however, have widened the abstraction gamut from device level to applications. Therefore, efficient abstractions that can provide vertical design-flow tools for such technologies became critical. Photonics in general, and neuromorphic photonics in particular, are among the promising alternatives to electronics. While the arsenal of device level toolbox for photonics, and high-level neural network platforms are rapidly expanding, there has not been much work to bridge this gap. Here, we present a design methodology to mitigate this problem by extending high-level hardware-agnostic neural network design tools with functional and performance models of photonic components. In this paper we detail this tool and methodology by using design examples and associated results. We show that adopting this approach enables designers to efficiently navigate the design space and devise hardware-aware systems with alternative technologies.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132214318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-14DOI: 10.1109/ASAP49362.2020.00030
Joel Mandebi Mbongue, Alex Shuping, Pankaj Bhowmik, C. Bobda
Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud. The proposed architecture implements a network-on-chip (NoC) designed for fast data movement and low hardware footprint. Prototyping the proposed architecture on a Xilinx Virtex Ultrascale + demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization (we achieved $6 times$ higher FPGA utilization with our case study), which is one of the major goals of virtualization. Overall, our NoC interconnect achieved about $2 times$ higher maximum frequency than the state-of-the-art and a bandwidth of 25.6 Gbps.
{"title":"Architecture Support for FPGA Multi-tenancy in the Cloud","authors":"Joel Mandebi Mbongue, Alex Shuping, Pankaj Bhowmik, C. Bobda","doi":"10.1109/ASAP49362.2020.00030","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00030","url":null,"abstract":"Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud. The proposed architecture implements a network-on-chip (NoC) designed for fast data movement and low hardware footprint. Prototyping the proposed architecture on a Xilinx Virtex Ultrascale + demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization (we achieved $6 times$ higher FPGA utilization with our case study), which is one of the major goals of virtualization. Overall, our NoC interconnect achieved about $2 times$ higher maximum frequency than the state-of-the-art and a bandwidth of 25.6 Gbps.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115551152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-04-28DOI: 10.1109/ASAP49362.2020.00026
N. Tye, James Timothy Meech, B. Bilgin, Phillip Stanley-Marbell
We introduce a new method for hardware nonuniform random number generation based on the transfer characteristics of graphene field-effect transistors (GFETs) which requires as few as two transistors and a resistor. We implement the method by fabricating multiple GFETs and experimentally validating that their transfer characteristics exhibit the nonlinearity on which our method depends. We use characterisation data in simulations of a proposed architecture for generating samples from dynamically selectable non-uniform probability distributions. The method we present has the potential for Gb/s sample rates, is reconfigurable for arbitrary target distributions, and has a wide range of possible applications. Using a combination of experimental measurements of GFETs under a range of biasing conditions and simulation of the GFET-based non-uniform random variate generator, we demonstrate a speedup of Monte Carlo integration by up to $2 times$. This speedup assumes the analog-to-digital converters reading the outputs from the circuit can produce samples in the same amount of time that it takes to perform memory accesses.
{"title":"A System for Generating Non-Uniform Random Variates using Graphene Field-Effect Transistors","authors":"N. Tye, James Timothy Meech, B. Bilgin, Phillip Stanley-Marbell","doi":"10.1109/ASAP49362.2020.00026","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00026","url":null,"abstract":"We introduce a new method for hardware nonuniform random number generation based on the transfer characteristics of graphene field-effect transistors (GFETs) which requires as few as two transistors and a resistor. We implement the method by fabricating multiple GFETs and experimentally validating that their transfer characteristics exhibit the nonlinearity on which our method depends. We use characterisation data in simulations of a proposed architecture for generating samples from dynamically selectable non-uniform probability distributions. The method we present has the potential for Gb/s sample rates, is reconfigurable for arbitrary target distributions, and has a wide range of possible applications. Using a combination of experimental measurements of GFETs under a range of biasing conditions and simulation of the GFET-based non-uniform random variate generator, we demonstrate a speedup of Monte Carlo integration by up to $2 times$. This speedup assumes the analog-to-digital converters reading the outputs from the circuit can produce samples in the same amount of time that it takes to perform memory accesses.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131154364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}