Development of application specific accelerators for deep convolutional neural networks (ConvNets) have mainly focussed on accelerating the computationally intensive layers, that is the convolutional layers, to improve performance and energy efficiency. Traditional approaches in this space have relied on handcrafted dataflow implementations to leverage the fine-grained parallelism and data-locality properties within these layers. However, ConvNets layers also have an untapped potential from cross-layer data locality. In our work, we explore a novel approach in the context of deep neural networks accelerators by modelling the computation as a task-dependency directed acyclic graph and proposing a memory-aware heuristic based onHeterogeneous Earliest Finish Time (HEFT) for task-graph scheduling on shared memory systems. Our results show the benefits of task graphs in terms of better memory use (23.4 % less) over conventional layer-by-layer processing in a simulated environment with the first three layers of LeNet-5. Certain task-graphs trade-off makespan (10% increase) for memory use (20 % decrease). Finally, our exploration of graphs with different slicing configurations for the pooling layer while using memory-aware HEFT versus the original HEFT reveals that regular shaped tiles across layers offers better makespan and memory use than tiles with large dimensions along one axis.
{"title":"Exploration of task-based scheduling for convolutional neural networks accelerators under memory constraints","authors":"Crefeda Faviola Rodrigues, G. Riley, M. Luján","doi":"10.1145/3310273.3323162","DOIUrl":"https://doi.org/10.1145/3310273.3323162","url":null,"abstract":"Development of application specific accelerators for deep convolutional neural networks (ConvNets) have mainly focussed on accelerating the computationally intensive layers, that is the convolutional layers, to improve performance and energy efficiency. Traditional approaches in this space have relied on handcrafted dataflow implementations to leverage the fine-grained parallelism and data-locality properties within these layers. However, ConvNets layers also have an untapped potential from cross-layer data locality. In our work, we explore a novel approach in the context of deep neural networks accelerators by modelling the computation as a task-dependency directed acyclic graph and proposing a memory-aware heuristic based onHeterogeneous Earliest Finish Time (HEFT) for task-graph scheduling on shared memory systems. Our results show the benefits of task graphs in terms of better memory use (23.4 % less) over conventional layer-by-layer processing in a simulated environment with the first three layers of LeNet-5. Certain task-graphs trade-off makespan (10% increase) for memory use (20 % decrease). Finally, our exploration of graphs with different slicing configurations for the pooling layer while using memory-aware HEFT versus the original HEFT reveals that regular shaped tiles across layers offers better makespan and memory use than tiles with large dimensions along one axis.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129770084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern Implantable Medical Devices (IMDs) feature wireless connectivity, which makes them vulnerable to security attacks. Particular to IMDs is the battery Denial-of-Service attack whereby attackers aim to fully deplete the battery by occupying the IMD with continuous authentication requests. Zero-Power Defense (ZPD) based on energy harvesting is known to be an excellent protection against these attacks. This paper establishes essential design specifications for employing ZPD techniques in IMDs, offers a critical review of ZPD techniques found in literature and, subsequently, gives crucial recommendations for developing comprehensive ZPD solutions.
{"title":"Towards realistic battery-DoS protection of implantable medical devices","authors":"M. Siddiqi, C. Strydis","doi":"10.1145/3310273.3321555","DOIUrl":"https://doi.org/10.1145/3310273.3321555","url":null,"abstract":"Modern Implantable Medical Devices (IMDs) feature wireless connectivity, which makes them vulnerable to security attacks. Particular to IMDs is the battery Denial-of-Service attack whereby attackers aim to fully deplete the battery by occupying the IMD with continuous authentication requests. Zero-Power Defense (ZPD) based on energy harvesting is known to be an excellent protection against these attacks. This paper establishes essential design specifications for employing ZPD techniques in IMDs, offers a critical review of ZPD techniques found in literature and, subsequently, gives crucial recommendations for developing comprehensive ZPD solutions.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121068043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Implantable Medical Devices (IMDs) such as pacemakers and neurostimulators are highly constrained in terms of energy. In addition, the wireless-communication facilities of these devices also impose security requirements considering their life-critical nature. However, security solutions that provide considerable coverage are generally considered to be too taxing on an IMD battery. Consequently, there has been a tendency to adopt ultra-lightweight security primitives for IMDs in literature. In this work, we demonstrate that the recent advances in embedded computing in fact enable the IMDs to use more mainstream security primitives, which do not need to compromise significantly on security for fear of impacting IMD autonomy.
{"title":"IMD security vs. energy: are we tilting at windmills?: POSTER","authors":"M. Siddiqi, C. Strydis","doi":"10.1145/3310273.3323421","DOIUrl":"https://doi.org/10.1145/3310273.3323421","url":null,"abstract":"Implantable Medical Devices (IMDs) such as pacemakers and neurostimulators are highly constrained in terms of energy. In addition, the wireless-communication facilities of these devices also impose security requirements considering their life-critical nature. However, security solutions that provide considerable coverage are generally considered to be too taxing on an IMD battery. Consequently, there has been a tendency to adopt ultra-lightweight security primitives for IMDs in literature. In this work, we demonstrate that the recent advances in embedded computing in fact enable the IMDs to use more mainstream security primitives, which do not need to compromise significantly on security for fear of impacting IMD autonomy.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125785750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the minimum vertex cover problem having applications in e.g. biochemistry and network security. Quantum annealers can find the optimum solution of such NP-hard problems, given they can be embedded on the hardware. This is often infeasible due to limitations of the hardware connectivity structure. This paper presents a decomposition algorithm for the minimum vertex cover problem: The algorithm recursively divides an arbitrary problem until the generated subproblems can be embedded and solved on the annealer. To speed up the decomposition, we propose several pruning and reduction techniques. The performance of our algorithm is assessed in a simulation study.
{"title":"Solving large minimum vertex cover problems on a quantum annealer","authors":"Elijah Pelofske, Georg Hahn, H. Djidjev","doi":"10.1145/3310273.3321562","DOIUrl":"https://doi.org/10.1145/3310273.3321562","url":null,"abstract":"We consider the minimum vertex cover problem having applications in e.g. biochemistry and network security. Quantum annealers can find the optimum solution of such NP-hard problems, given they can be embedded on the hardware. This is often infeasible due to limitations of the hardware connectivity structure. This paper presents a decomposition algorithm for the minimum vertex cover problem: The algorithm recursively divides an arbitrary problem until the generated subproblems can be embedded and solved on the annealer. To speed up the decomposition, we propose several pruning and reduction techniques. The performance of our algorithm is assessed in a simulation study.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124121617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MP net is a formal model specifically designed for the field of parallel applications that use message passing interface. The main idea is to use MP net as a comprehensible way of presenting the actual structure of communication within MPI applications. The goal is to provide users with the kind of feedback that can help them to check quickly whether or not the actual communication within their application corresponds to the intended one. This paper introduces MP net that focuses on the communication part of parallel applications and emphasizes its spatial character, which is rather hidden in sequential (textual) form.
{"title":"MP net as abstract model of communication for message-passing applications","authors":"Martin Surkovský","doi":"10.1145/3310273.3322824","DOIUrl":"https://doi.org/10.1145/3310273.3322824","url":null,"abstract":"MP net is a formal model specifically designed for the field of parallel applications that use message passing interface. The main idea is to use MP net as a comprehensible way of presenting the actual structure of communication within MPI applications. The goal is to provide users with the kind of feedback that can help them to check quickly whether or not the actual communication within their application corresponds to the intended one. This paper introduces MP net that focuses on the communication part of parallel applications and emphasizes its spatial character, which is rather hidden in sequential (textual) form.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130660413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Swamit S. Tannu, Poulami Das, Michael L. Lewis, Robert F. Krick, Douglas M. Carmean, Moinuddin K. Qureshi
As scaling of CMOS slows down, there is growing interest in alternative technologies that can improve performance and energy-efficiency. Superconducting circuits based on Josephson Junctions (JJ) is an emerging technology that provides devices which can be switched with pico-second latencies and consumes two orders of magnitude lower switching energy compared to CMOS. While JJ-based circuits can operate at high frequencies and are energy-efficient, the technology faces three critical challenges: limited device density and lack of area-efficient technology for memory structures, low gate fanout, and new failure modes of Flux-Traps that occurs due to the operating environment. Limited memory density restricts the use of superconducting technology in the near term to application domains that have high compute intensity but require negligible amount of memory. In this paper, we study the use of superconducting technology to build an accelerator for SHA-256 engines commonly used in Bitcoin mining. We show that merely porting existing CMOS-based accelerator to superconducting technology provides 10.6X improvement in energy efficiency. Redesigning the accelerator to suit the unique constraints of superconducting technology (such as low fanout) improves the energy efficiency to 12.2X. We also investigate solutions to make the accelerator tolerant of new fault modes and show how this fault-tolerant design can be leveraged to reduce the operating current, thereby improving the overall energy-efficiency to 46X.
{"title":"A case for superconducting accelerators","authors":"Swamit S. Tannu, Poulami Das, Michael L. Lewis, Robert F. Krick, Douglas M. Carmean, Moinuddin K. Qureshi","doi":"10.1145/3310273.3321561","DOIUrl":"https://doi.org/10.1145/3310273.3321561","url":null,"abstract":"As scaling of CMOS slows down, there is growing interest in alternative technologies that can improve performance and energy-efficiency. Superconducting circuits based on Josephson Junctions (JJ) is an emerging technology that provides devices which can be switched with pico-second latencies and consumes two orders of magnitude lower switching energy compared to CMOS. While JJ-based circuits can operate at high frequencies and are energy-efficient, the technology faces three critical challenges: limited device density and lack of area-efficient technology for memory structures, low gate fanout, and new failure modes of Flux-Traps that occurs due to the operating environment. Limited memory density restricts the use of superconducting technology in the near term to application domains that have high compute intensity but require negligible amount of memory. In this paper, we study the use of superconducting technology to build an accelerator for SHA-256 engines commonly used in Bitcoin mining. We show that merely porting existing CMOS-based accelerator to superconducting technology provides 10.6X improvement in energy efficiency. Redesigning the accelerator to suit the unique constraints of superconducting technology (such as low fanout) improves the energy efficiency to 12.2X. We also investigate solutions to make the accelerator tolerant of new fault modes and show how this fault-tolerant design can be leveraged to reduce the operating current, thereby improving the overall energy-efficiency to 46X.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125236996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Homomorphic encryption (HE)---the ability to perform computation on encrypted data---is an attractive remedy to increasing concerns about data privacy in deep learning (DL). However, building DL models that operate on ciphertext is currently labor-intensive and requires simultaneous expertise in DL, cryptography, and software engineering. DL frameworks and recent advances in graph compilers have greatly accelerated the training and deployment of DL models to various computing platforms. We introduce nGraph-HE, an extension of nGraph, Intel's DL graph compiler, which enables deployment of trained models with popular frameworks such as TensorFlow while simply treating HE as another hardware target. Our graph-compiler approach enables HE-aware optimizations- implemented at compile-time, such as constant folding and HE-SIMD packing, and at run-time, such as special value plaintext bypass. Furthermore, nGraph-HE integrates with DL frameworks such as TensorFlow, enabling data scientists to benchmark DL models with minimal overhead.
{"title":"nGraph-HE: a graph compiler for deep learning on homomorphically encrypted data","authors":"Fabian Boemer, Yixing Lao, Casimir Wierzynski","doi":"10.1145/3310273.3323047","DOIUrl":"https://doi.org/10.1145/3310273.3323047","url":null,"abstract":"Homomorphic encryption (HE)---the ability to perform computation on encrypted data---is an attractive remedy to increasing concerns about data privacy in deep learning (DL). However, building DL models that operate on ciphertext is currently labor-intensive and requires simultaneous expertise in DL, cryptography, and software engineering. DL frameworks and recent advances in graph compilers have greatly accelerated the training and deployment of DL models to various computing platforms. We introduce nGraph-HE, an extension of nGraph, Intel's DL graph compiler, which enables deployment of trained models with popular frameworks such as TensorFlow while simply treating HE as another hardware target. Our graph-compiler approach enables HE-aware optimizations- implemented at compile-time, such as constant folding and HE-SIMD packing, and at run-time, such as special value plaintext bypass. Furthermore, nGraph-HE integrates with DL frameworks such as TensorFlow, enabling data scientists to benchmark DL models with minimal overhead.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123795871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthew Sotoudeh, Anand Venkat, Michael J. Anderson, E. Georganas, A. Heinecke, Jason Knight
Domain specific accelerators present new challenges for code generation onto novel instruction sets, communication fabrics, and memory architectures. We introduce a shared intermediate representation to describe both deep learning programs and hardware capabilities, then formulate and apply instruction mapping to determine how a computation can be performed on a hardware system. Our scheduler chooses a specific mapping and determines data movement and computation order. With this system, we demonstrate automated extraction of matrix multiplication kernels from recent deep learning operations. We demonstrate 2--5X better performance on GEMM and GRU execution versus state-of-the-art on new hardware and up to 85% of state-of-the-art performance on existing hardware.
{"title":"ISA mapper: a compute and hardware agnostic deep learning compiler","authors":"Matthew Sotoudeh, Anand Venkat, Michael J. Anderson, E. Georganas, A. Heinecke, Jason Knight","doi":"10.1145/3310273.3321559","DOIUrl":"https://doi.org/10.1145/3310273.3321559","url":null,"abstract":"Domain specific accelerators present new challenges for code generation onto novel instruction sets, communication fabrics, and memory architectures. We introduce a shared intermediate representation to describe both deep learning programs and hardware capabilities, then formulate and apply instruction mapping to determine how a computation can be performed on a hardware system. Our scheduler chooses a specific mapping and determines data movement and computation order. With this system, we demonstrate automated extraction of matrix multiplication kernels from recent deep learning operations. We demonstrate 2--5X better performance on GEMM and GRU execution versus state-of-the-art on new hardware and up to 85% of state-of-the-art performance on existing hardware.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127031896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we present Quanrum Encoded Quantum Evolutionary Algorithm (QEQEA) and compare its performance against a a classical GPU accelerated Genetic Algorithm (GPUGA). The proposed QEQEA differs from existing quantum evolutionary algorithms in several points: representation of candidates circuits is using qubits and qutrits and the proposed evolutionary operators can theoretically be implemented on quantum computer provided a classical control exists. The synthesized circuits are obtained by a set of measurements performed on the encoding units of quantum representation. Both algorithms are accelerated using (general purpose graphic processing unit) GPGPU. The main target of this paper is not to propose a completely novel quantum genetic algorithm but to rather experimentally estimate the advantages of certain components of genetic algorithm being encoded and implemented in a quantum compatible manner. The algorithms are compared and evaluated on several reversible and quantum circuits. The results demonstrate that on one hand the quantum encoding and quantum implementation compatible implementation provides certain disadvantages with respect to the classical evolutionary computation. On the other hand, encoding certain components in a quantum compatible manner could in theory allow to accelerate the search by providing small overhead when built in quantum computer. Therefore acceleration would in turn counter weight the implementation limitations.
{"title":"Quantum encoded quantum evolutionary algorithm for the design of quantum circuits","authors":"Georgiy Krylov, M. Lukac","doi":"10.1145/3310273.3322826","DOIUrl":"https://doi.org/10.1145/3310273.3322826","url":null,"abstract":"In this paper we present Quanrum Encoded Quantum Evolutionary Algorithm (QEQEA) and compare its performance against a a classical GPU accelerated Genetic Algorithm (GPUGA). The proposed QEQEA differs from existing quantum evolutionary algorithms in several points: representation of candidates circuits is using qubits and qutrits and the proposed evolutionary operators can theoretically be implemented on quantum computer provided a classical control exists. The synthesized circuits are obtained by a set of measurements performed on the encoding units of quantum representation. Both algorithms are accelerated using (general purpose graphic processing unit) GPGPU. The main target of this paper is not to propose a completely novel quantum genetic algorithm but to rather experimentally estimate the advantages of certain components of genetic algorithm being encoded and implemented in a quantum compatible manner. The algorithms are compared and evaluated on several reversible and quantum circuits. The results demonstrate that on one hand the quantum encoding and quantum implementation compatible implementation provides certain disadvantages with respect to the classical evolutionary computation. On the other hand, encoding certain components in a quantum compatible manner could in theory allow to accelerate the search by providing small overhead when built in quantum computer. Therefore acceleration would in turn counter weight the implementation limitations.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129784080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose personal volunteer computing, a novel paradigm to encourage technical solutions that leverage personal devices, such as smartphones and laptops, for personal applications that require significant computations, such as animation rendering and image processing. The paradigm requires no investment in additional hardware, relying instead on devices that are already owned by users and their community, and favours simple tools that can be implemented part-time by a single developer. We show that samples of personal devices of today are competitive with a top-of-the-line laptop from two years ago. We also propose new directions to extend the paradigm.
{"title":"Personal volunteer computing","authors":"Erick Lavoie, L. Hendren","doi":"10.1145/3310273.3322819","DOIUrl":"https://doi.org/10.1145/3310273.3322819","url":null,"abstract":"We propose personal volunteer computing, a novel paradigm to encourage technical solutions that leverage personal devices, such as smartphones and laptops, for personal applications that require significant computations, such as animation rendering and image processing. The paradigm requires no investment in additional hardware, relying instead on devices that are already owned by users and their community, and favours simple tools that can be implemented part-time by a single developer. We show that samples of personal devices of today are competitive with a top-of-the-line laptop from two years ago. We also propose new directions to extend the paradigm.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121599705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}