Walter Lau Neto, L. Amarù, V. Possani, P. Vuillod, Jiong Luo, A. Mishchenko, P. Gaillardon
LUT-based optimization techniques are finding new applications in synthesis of ASIC designs. Intuitively, packing logic into LUTs provides a better balance between functionality and structure in logic optimization. On this basis, the LUT-engine framework [1] was introduced to enhance the ASIC synthesis. In this paper, we present key improvements, at both algorithmic and flow levels, making a much stronger LUT-engine. We restructure the flow of LUT-engine, to benefit from a heterogeneous mixture of LUT sizes, and revisit its requirements for maximum scalability. We propose a dedicated LUT mapper for the new flow, based on FlowMap, natively balancing LUT-count and NAND2-count for a wide range LUT sizes. We describe a specialized Boolean factoring technique, exploiting the fanin bounds in LUT networks, resulting in a very fast LUT-based AIG minimization. By using the proposed methodology, we improve 9 of the best area results in the ongoing EPFL synthesis competition. Integrated in a complete EDA flow for ASICs, the new LUT-engine performs well on a set of 87 benchmarks: -4.60% area and -3.41% switching power at +5% runtime, compared to the baseline flow without LUT-based optimizations, and -3.02% area and -2.54% switching power with -1% runtime, compared to the original LUT-engine.
{"title":"Improving LUT-based optimization for ASICs","authors":"Walter Lau Neto, L. Amarù, V. Possani, P. Vuillod, Jiong Luo, A. Mishchenko, P. Gaillardon","doi":"10.1145/3489517.3530461","DOIUrl":"https://doi.org/10.1145/3489517.3530461","url":null,"abstract":"LUT-based optimization techniques are finding new applications in synthesis of ASIC designs. Intuitively, packing logic into LUTs provides a better balance between functionality and structure in logic optimization. On this basis, the LUT-engine framework [1] was introduced to enhance the ASIC synthesis. In this paper, we present key improvements, at both algorithmic and flow levels, making a much stronger LUT-engine. We restructure the flow of LUT-engine, to benefit from a heterogeneous mixture of LUT sizes, and revisit its requirements for maximum scalability. We propose a dedicated LUT mapper for the new flow, based on FlowMap, natively balancing LUT-count and NAND2-count for a wide range LUT sizes. We describe a specialized Boolean factoring technique, exploiting the fanin bounds in LUT networks, resulting in a very fast LUT-based AIG minimization. By using the proposed methodology, we improve 9 of the best area results in the ongoing EPFL synthesis competition. Integrated in a complete EDA flow for ASICs, the new LUT-engine performs well on a set of 87 benchmarks: -4.60% area and -3.41% switching power at +5% runtime, compared to the baseline flow without LUT-based optimizations, and -3.02% area and -2.54% switching power with -1% runtime, compared to the original LUT-engine.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121470997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shubham Negi, I. Chakraborty, Aayush Ankit, K. Roy
Neural Architecture Search (NAS) has provided the ability to design efficient deep neural network (DNN) catered towards different hardwares like GPUs, CPUs etc. However, integrating NAS with Memristive Crossbar Array (MCA) based In-Memory Computing (IMC) accelerator remains an open problem. The hardware efficiency (energy, latency and area) as well as application accuracy (considering device and circuit non-idealities) of DNNs mapped to such hardware are co-dependent on network parameters such as kernel size, depth etc. and hardware architecture parameters such as crossbar size and the precision of analog-to-digital converters. Co-optimization of both network and hardware parameters presents a challenging search space comprising of different kernel sizes mapped to varying crossbar sizes. To that effect, we propose NAX - an efficient neural architecture search engine that co-designs neural network and IMC based hardware architecture. NAX explores the aforementioned search space to determine kernel and corresponding crossbar sizes for each DNN layer to achieve optimal tradeoffs between hardware efficiency and application accuracy. For CIFAR-10 and Tiny ImageNet, our models achieve 0.9% and 18.57% higher accuracy at 30% and -10.47% lower EDAP (energy-delay-area product), compared to baseline ResNet-20 and ResNet-18 models, respectively.
{"title":"NAX","authors":"Shubham Negi, I. Chakraborty, Aayush Ankit, K. Roy","doi":"10.1145/3489517.3530476","DOIUrl":"https://doi.org/10.1145/3489517.3530476","url":null,"abstract":"Neural Architecture Search (NAS) has provided the ability to design efficient deep neural network (DNN) catered towards different hardwares like GPUs, CPUs etc. However, integrating NAS with Memristive Crossbar Array (MCA) based In-Memory Computing (IMC) accelerator remains an open problem. The hardware efficiency (energy, latency and area) as well as application accuracy (considering device and circuit non-idealities) of DNNs mapped to such hardware are co-dependent on network parameters such as kernel size, depth etc. and hardware architecture parameters such as crossbar size and the precision of analog-to-digital converters. Co-optimization of both network and hardware parameters presents a challenging search space comprising of different kernel sizes mapped to varying crossbar sizes. To that effect, we propose NAX - an efficient neural architecture search engine that co-designs neural network and IMC based hardware architecture. NAX explores the aforementioned search space to determine kernel and corresponding crossbar sizes for each DNN layer to achieve optimal tradeoffs between hardware efficiency and application accuracy. For CIFAR-10 and Tiny ImageNet, our models achieve 0.9% and 18.57% higher accuracy at 30% and -10.47% lower EDAP (energy-delay-area product), compared to baseline ResNet-20 and ResNet-18 models, respectively.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121506923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to rapidly growing design complexity, timing macro modeling has been widely adopted to enable hierarchical and parallel timing analysis. The main challenge of timing macro modeling is to identify timing variant pins for achieving high timing accuracy while keeping a compact model size. To tackle this challenge, prior work applied ad-hoc techniques and threshold setting. In this work, we present a novel timing macro modeling approach based on graph neural networks (GNNs). A timing sensitivity metric is proposed to precisely evaluate the influence of each pin on the timing accuracy. Based on the timing sensitivity data and the circuit topology, the GNN model can effectively learn and capture timing variant pins. Experimental results show that our GNN-based framework reduces 10% model sizes while preserving the same timing accuracy as the state-of-the-art. Furthermore, taking common path pessimism removal (CPPR) as an example, the generality and applicability of our framework on various timing analysis models and modes are also validated empirically.
{"title":"Timing macro modeling with graph neural networks","authors":"K. Chang, Chun-Yao Chiang, Pei-Yu Lee, I. Jiang","doi":"10.1145/3489517.3530599","DOIUrl":"https://doi.org/10.1145/3489517.3530599","url":null,"abstract":"Due to rapidly growing design complexity, timing macro modeling has been widely adopted to enable hierarchical and parallel timing analysis. The main challenge of timing macro modeling is to identify timing variant pins for achieving high timing accuracy while keeping a compact model size. To tackle this challenge, prior work applied ad-hoc techniques and threshold setting. In this work, we present a novel timing macro modeling approach based on graph neural networks (GNNs). A timing sensitivity metric is proposed to precisely evaluate the influence of each pin on the timing accuracy. Based on the timing sensitivity data and the circuit topology, the GNN model can effectively learn and capture timing variant pins. Experimental results show that our GNN-based framework reduces 10% model sizes while preserving the same timing accuracy as the state-of-the-art. Furthermore, taking common path pessimism removal (CPPR) as an example, the generality and applicability of our framework on various timing analysis models and modes are also validated empirically.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"39 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114021996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optane Persistent Memory (PM) is a pioneering solution to byte-addressable PM for commodity systems. However, the performance of Optane PM is highly workload-sensitive, rendering many prior designs of Key-Value (KV) store inefficient. To cope with this reality, we advocate rethinking KV store design for Optane PM. Our design follows a principle of Single-stream Writing with managed Multi-stream Reading (SWMR): Incoming KV pairs are written to PM through a single write stream and managed by an ordered index in DRAM. Through asynchronously sorting and rewriting large sets of KV pairs, range queries are handled with a managed number of concurrent streams. YCSB results show that our design improved upon existing ones by 116% and 21% for write-only throughput and read-write throughput, respectively.
{"title":"Rethinking key-value store for byte-addressable optane persistent memory","authors":"Sung-Ming Wu, Li-Pin Chang","doi":"10.1145/3489517.3530535","DOIUrl":"https://doi.org/10.1145/3489517.3530535","url":null,"abstract":"Optane Persistent Memory (PM) is a pioneering solution to byte-addressable PM for commodity systems. However, the performance of Optane PM is highly workload-sensitive, rendering many prior designs of Key-Value (KV) store inefficient. To cope with this reality, we advocate rethinking KV store design for Optane PM. Our design follows a principle of Single-stream Writing with managed Multi-stream Reading (SWMR): Incoming KV pairs are written to PM through a single write stream and managed by an ordered index in DRAM. Through asynchronously sorting and rewriting large sets of KV pairs, range queries are handled with a managed number of concurrent streams. YCSB results show that our design improved upon existing ones by 116% and 21% for write-only throughput and read-write throughput, respectively.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125699166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A visionary computing paradigm is to train resource efficient neural networks on the edge using dedicated low-power accelerators instead of cloud infrastructures, eliminating communication overheads and privacy concerns. One promising resource-efficient approach for inference is binarized neural networks (BNNs), which binarize parameters and activations. However, training BNNs remains resource demanding. State-of-the-art BNN training methods, such as the binary optimizer (Bop), require to store and update a large number of momentum values in the floating point (FP) format. In this work, we focus on memory-efficient FP encodings for the momentum values in Bop. To achieve this, we first investigate the impact of arbitrary FP encodings. When the FP format is not properly chosen, we prove that the updates of the momentum values can be lost and the quality of training is therefore dropped. With the insights, we formulate a metric to determine the number of unchanged momentum values in a training iteration due to the FP encoding. Based on the metric, we develop an algorithm to find FP encodings that are more memory-efficient than the standard FP encodings. In our experiments, the memory usage in BNN training is decreased by factors 2.47x, 2.43x, 2.04x, depending on the BNN model, with minimal accuracy cost (smaller than 1%) compared to using 32-bit FP encoding.
{"title":"Memory-efficient training of binarized neural networks on the edge","authors":"Mikail Yayla, Jian-Jia Chen","doi":"10.1145/3489517.3530496","DOIUrl":"https://doi.org/10.1145/3489517.3530496","url":null,"abstract":"A visionary computing paradigm is to train resource efficient neural networks on the edge using dedicated low-power accelerators instead of cloud infrastructures, eliminating communication overheads and privacy concerns. One promising resource-efficient approach for inference is binarized neural networks (BNNs), which binarize parameters and activations. However, training BNNs remains resource demanding. State-of-the-art BNN training methods, such as the binary optimizer (Bop), require to store and update a large number of momentum values in the floating point (FP) format. In this work, we focus on memory-efficient FP encodings for the momentum values in Bop. To achieve this, we first investigate the impact of arbitrary FP encodings. When the FP format is not properly chosen, we prove that the updates of the momentum values can be lost and the quality of training is therefore dropped. With the insights, we formulate a metric to determine the number of unchanged momentum values in a training iteration due to the FP encoding. Based on the metric, we develop an algorithm to find FP encodings that are more memory-efficient than the standard FP encodings. In our experiments, the memory usage in BNN training is decreased by factors 2.47x, 2.43x, 2.04x, depending on the BNN model, with minimal accuracy cost (smaller than 1%) compared to using 32-bit FP encoding.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126133636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As state-of-the-art quantum computers are capable of running increasingly complex algorithms, the need for automated methods to design and test potential applications rises. Equivalence checking of quantum circuits is an important, yet hardly automated, task in the development of the quantum software stack. Recently, new methods have been proposed that tackle this problem from widely different perspectives. However, there is no established baseline on which to judge current and future progress in equivalence checking of quantum circuits. In order to close this gap, we conduct a detailed case study of two of the most promising equivalence checking methodologies---one based on decision diagrams and one based on the ZX-calculus---and compare their strengths and weaknesses.
{"title":"Equivalence checking paradigms in quantum circuit design: a case study","authors":"Tom Peham, Lukas Burgholzer, R. Wille","doi":"10.1145/3489517.3530480","DOIUrl":"https://doi.org/10.1145/3489517.3530480","url":null,"abstract":"As state-of-the-art quantum computers are capable of running increasingly complex algorithms, the need for automated methods to design and test potential applications rises. Equivalence checking of quantum circuits is an important, yet hardly automated, task in the development of the quantum software stack. Recently, new methods have been proposed that tackle this problem from widely different perspectives. However, there is no established baseline on which to judge current and future progress in equivalence checking of quantum circuits. In order to close this gap, we conduct a detailed case study of two of the most promising equivalence checking methodologies---one based on decision diagrams and one based on the ZX-calculus---and compare their strengths and weaknesses.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131916519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Annealing-based Ising machines have shown promising results in solving combinatorial optimization problems. As a typical class of these problems, however, traveling salesman problems (TSPs) are very challenging to solve due to the constraints imposed on the solution. This article proposes a parallel annealing algorithm for a fully connected Ising machine that significantly improves the accuracy and performance in solving constrained combinatorial optimization problems such as the TSP. Unlike previous parallel annealing algorithms, this improved parallel annealing (IPA) algorithm efficiently solves TSPs using an exponential temperature function with a dynamic offset. Compared with digital annealing (DA) and momentum annealing (MA), the IPA reduces the run time by 44.4 times and 19.9 times for a 14-city TSP, respectively. Large scale TSPs can be more efficiently solved by taking a k-medoids clustering approach that decreases the average travel distance of a 22-city TSP by 51.8% compared with DA and by 42.0% compared with MA. This approach groups neighboring cities into clusters to form a reduced TSP, which is then solved in a hierarchical manner by using the IPA algorithm.
{"title":"Solving traveling salesman problems via a parallel fully connected ising machine","authors":"Qichao Tao, Jie Han","doi":"10.1145/3489517.3530595","DOIUrl":"https://doi.org/10.1145/3489517.3530595","url":null,"abstract":"Annealing-based Ising machines have shown promising results in solving combinatorial optimization problems. As a typical class of these problems, however, traveling salesman problems (TSPs) are very challenging to solve due to the constraints imposed on the solution. This article proposes a parallel annealing algorithm for a fully connected Ising machine that significantly improves the accuracy and performance in solving constrained combinatorial optimization problems such as the TSP. Unlike previous parallel annealing algorithms, this improved parallel annealing (IPA) algorithm efficiently solves TSPs using an exponential temperature function with a dynamic offset. Compared with digital annealing (DA) and momentum annealing (MA), the IPA reduces the run time by 44.4 times and 19.9 times for a 14-city TSP, respectively. Large scale TSPs can be more efficiently solved by taking a k-medoids clustering approach that decreases the average travel distance of a 22-city TSP by 51.8% compared with DA and by 42.0% compared with MA. This approach groups neighboring cities into clusters to form a reduced TSP, which is then solved in a hierarchical manner by using the IPA algorithm.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129984976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Subhash Sethumurugan, Shashank Hegde, Hari Cherupalli, J. Sartori
Recent work has demonstrated the effectiveness of using symbolic simulation to perform hardware software co-analysis on an application-processor pair and developed a variety of hardware and software design techniques and optimizations, ranging from providing system security guarantees to automated generation of application-specific bespoke processors. Despite their potential benefits, current state-of-the-art symbolic simulation tools for hardware-software co-analysis are restricted in their applicability, since prior work relies on a costly process of building a custom simulation tool for each processor design to be simulated. Furthermore, prior work does not describe how to extend the symbolic analysis technique to other processor designs. In an effort to generalize the technique for any processor design, we propose a custom symbolic simulator that uses iverilog to perform symbolic behavioral simulation. With iverilog - an open source synthesis and simulation tool - we implement a design-agnostic symbolic simulation tool for hardware-software co-analysis. To demonstrate the generality of our tool, we apply symbolic analysis to three embedded processors with different ISAs: bm32 (a MIPS-based processor), darkRiscV (a RISC-V-based processor), and openMSP430 (based on MSP430). We use analysis results to generate bespoke processors for each design and observe gate count reductions of 27%, 16%, and 56% on these processors, respectively. Our results demonstrate the versatility of our simulation tool and the uniqueness of each design with respect to symbolic analysis and the bespoke methodology.
{"title":"A scalable symbolic simulation tool for low power embedded systems","authors":"Subhash Sethumurugan, Shashank Hegde, Hari Cherupalli, J. Sartori","doi":"10.1145/3489517.3530433","DOIUrl":"https://doi.org/10.1145/3489517.3530433","url":null,"abstract":"Recent work has demonstrated the effectiveness of using symbolic simulation to perform hardware software co-analysis on an application-processor pair and developed a variety of hardware and software design techniques and optimizations, ranging from providing system security guarantees to automated generation of application-specific bespoke processors. Despite their potential benefits, current state-of-the-art symbolic simulation tools for hardware-software co-analysis are restricted in their applicability, since prior work relies on a costly process of building a custom simulation tool for each processor design to be simulated. Furthermore, prior work does not describe how to extend the symbolic analysis technique to other processor designs. In an effort to generalize the technique for any processor design, we propose a custom symbolic simulator that uses iverilog to perform symbolic behavioral simulation. With iverilog - an open source synthesis and simulation tool - we implement a design-agnostic symbolic simulation tool for hardware-software co-analysis. To demonstrate the generality of our tool, we apply symbolic analysis to three embedded processors with different ISAs: bm32 (a MIPS-based processor), darkRiscV (a RISC-V-based processor), and openMSP430 (based on MSP430). We use analysis results to generate bespoke processors for each design and observe gate count reductions of 27%, 16%, and 56% on these processors, respectively. Our results demonstrate the versatility of our simulation tool and the uniqueness of each design with respect to symbolic analysis and the bespoke methodology.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130022175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pei Cao, Hongyi Zhang, Dawu Gu, Yan Lu, Yidong Yuan
In this paper, we focus on the portability issue in profiled side-channel attacks (SCAs) that arises due to significant device-to-device variations. Device discrepancy is inevitable in realistic attacks, but it is often neglected in research works. In this paper, we identify such device variations and take a further step towards leveraging the transferability of neural networks. We propose a novel adversarial learning-based profiled attack (AL-PA), which enables our neural network to learn device-invariant features. We evaluated our strategy on eight XMEGA microcontrollers. Without the need for target-specific preprocessing and multiple profiling devices, our approach has outperformed the state-of-the-art methods.
{"title":"AL-PA: cross-device profiled side-channel attack using adversarial learning","authors":"Pei Cao, Hongyi Zhang, Dawu Gu, Yan Lu, Yidong Yuan","doi":"10.1145/3489517.3530517","DOIUrl":"https://doi.org/10.1145/3489517.3530517","url":null,"abstract":"In this paper, we focus on the portability issue in profiled side-channel attacks (SCAs) that arises due to significant device-to-device variations. Device discrepancy is inevitable in realistic attacks, but it is often neglected in research works. In this paper, we identify such device variations and take a further step towards leveraging the transferability of neural networks. We propose a novel adversarial learning-based profiled attack (AL-PA), which enables our neural network to learn device-invariant features. We evaluated our strategy on eight XMEGA microcontrollers. Without the need for target-specific preprocessing and multiple profiling devices, our approach has outperformed the state-of-the-art methods.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131506650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adiabatic quantum-flux parametron (AQFP) is an energy-efficient superconducting technology. Buffer and splitter (B/S) cells must be inserted to an AQFP circuit to meet the technology-imposed constraints on path balancing and fanout branching. These cells account for a significant amount of the circuit's area and delay. In this paper, we identify that B/S insertion is a scheduling problem, and propose (a) a linear-time algorithm for locally optimal B/S insertion subject to a given schedule; (b) an SMT formulation to find the global optimum; and (c) an efficient heuristic for global B/S optimization. Experimental results show a reduction of 4% on the B/S cost and 124X speed-up compared to the state-of-the-art algorithm, and capability to scale to a magnitude larger benchmarks.
{"title":"Beyond local optimality of buffer and splitter insertion for AQFP circuits","authors":"Siang-Yun Lee, Heinz Riener, G. De Micheli","doi":"10.1145/3489517.3530661","DOIUrl":"https://doi.org/10.1145/3489517.3530661","url":null,"abstract":"Adiabatic quantum-flux parametron (AQFP) is an energy-efficient superconducting technology. Buffer and splitter (B/S) cells must be inserted to an AQFP circuit to meet the technology-imposed constraints on path balancing and fanout branching. These cells account for a significant amount of the circuit's area and delay. In this paper, we identify that B/S insertion is a scheduling problem, and propose (a) a linear-time algorithm for locally optimal B/S insertion subject to a given schedule; (b) an SMT formulation to find the global optimum; and (c) an efficient heuristic for global B/S optimization. Experimental results show a reduction of 4% on the B/S cost and 124X speed-up compared to the state-of-the-art algorithm, and capability to scale to a magnitude larger benchmarks.","PeriodicalId":373005,"journal":{"name":"Proceedings of the 59th ACM/IEEE Design Automation Conference","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117319026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}