J. Zhang, Kang Liu, Faiq Khalid, Muhammad Abdullah Hanif, Semeen Rehman, T. Theocharides, Alessandro Artussi, M. Shafique, S. Garg
Machine learning, in particular deep learning, is being used in almost all the aspects of life to facilitate humans, specifically in mobile and Internet of Things (IoT)-based applications. Due to its state-of-the-art performance, deep learning is also being employed in safety-critical applications, for instance, autonomous vehicles. Reliability and security are two of the key required characteristics for these applications because of the impact they can have on human's life. Towards this, in this paper, we highlight the current progress, challenges and research opportunities in the domain of robust systems for machine learning-based applications.
{"title":"Building Robust Machine Learning Systems: Current Progress, Research Challenges, and Opportunities","authors":"J. Zhang, Kang Liu, Faiq Khalid, Muhammad Abdullah Hanif, Semeen Rehman, T. Theocharides, Alessandro Artussi, M. Shafique, S. Garg","doi":"10.1145/3316781.3323472","DOIUrl":"https://doi.org/10.1145/3316781.3323472","url":null,"abstract":"Machine learning, in particular deep learning, is being used in almost all the aspects of life to facilitate humans, specifically in mobile and Internet of Things (IoT)-based applications. Due to its state-of-the-art performance, deep learning is also being employed in safety-critical applications, for instance, autonomous vehicles. Reliability and security are two of the key required characteristics for these applications because of the impact they can have on human's life. Towards this, in this paper, we highlight the current progress, challenges and research opportunities in the domain of robust systems for machine learning-based applications.","PeriodicalId":391209,"journal":{"name":"Proceedings of the 56th Annual Design Automation Conference 2019","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129215326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaahin Angizi, Jiao-Jin Sun, Wei Zhang, Deliang Fan
Classified as a complex big data analytics problem, DNA short read alignment serves as a major sequential bottleneck to massive amounts of data generated by next-generation sequencing platforms. With Von-Neumann computing architectures struggling to address such computationally-expensive and memory-intensive task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this paper, an energy-efficient and parallel PIM accelerator (AlignS) is proposed to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first develop AlignS’s platform that harnesses SOT-MRAM as computational memory and transforms it to a fundamental processing unit for short read alignment. Accordingly, we present a novel, customized, highly parallel read alignment algorithm that only seeks the proposed simple and parallel in-memory operations (i.e. comparisons and additions). AlignS is then optimized through a new correlated data partitioning and mapping methodology that allows local storage and processing of DNA sequence to fully exploit the algorithm-level’s parallelism, and to accelerate both exact and inexact matches. The device-to-architecture co-simulation results show that AlignS improves the short read alignment throughput per Watt per mm2 by ~12× compared to the ASIC accelerator. Compared to recent FM-index-based ReRAM platform, AlignS achieves 1.6× higher throughput per Watt.
{"title":"AlignS","authors":"Shaahin Angizi, Jiao-Jin Sun, Wei Zhang, Deliang Fan","doi":"10.1145/3316781.3317764","DOIUrl":"https://doi.org/10.1145/3316781.3317764","url":null,"abstract":"Classified as a complex big data analytics problem, DNA short read alignment serves as a major sequential bottleneck to massive amounts of data generated by next-generation sequencing platforms. With Von-Neumann computing architectures struggling to address such computationally-expensive and memory-intensive task today, Processing-in-Memory (PIM) platforms are gaining growing interests. In this paper, an energy-efficient and parallel PIM accelerator (AlignS) is proposed to execute DNA short read alignment based on an optimized and hardware-friendly alignment algorithm. We first develop AlignS’s platform that harnesses SOT-MRAM as computational memory and transforms it to a fundamental processing unit for short read alignment. Accordingly, we present a novel, customized, highly parallel read alignment algorithm that only seeks the proposed simple and parallel in-memory operations (i.e. comparisons and additions). AlignS is then optimized through a new correlated data partitioning and mapping methodology that allows local storage and processing of DNA sequence to fully exploit the algorithm-level’s parallelism, and to accelerate both exact and inexact matches. The device-to-architecture co-simulation results show that AlignS improves the short read alignment throughput per Watt per mm2 by ~12× compared to the ASIC accelerator. Compared to recent FM-index-based ReRAM platform, AlignS achieves 1.6× higher throughput per Watt.","PeriodicalId":391209,"journal":{"name":"Proceedings of the 56th Annual Design Automation Conference 2019","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116636669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shijun Gong, Jiajun Li, Wenyan Lu, Guihai Yan, Xiaowei Li
Streaming processing is an important and growing class of applications for analyzing continuous streams of real time data. Slidingwindow aggregations (SWAGs) dominate the computation time in such applications and dictate an unprecedented computation capacity which poses a great challenge to the computing architectures. General-purpose processors cannot efficiently handle SWAGs because of the specific computation patterns. This paper proposes an efficient accelerator architecture for ubiquitous SWAGs, called ShuntFlow. ShuntFlow is a typical type of Kernel Processing Unit (KPU) where “Kernel” represent two main categories of SWAG operations widely used in streaming processing. Meanwhile, we propose a shunt rule to enable ShuntFlow to efficiently handle SWAGs with arbitrary parameters. As a case study, we implemented ShuntFlow on an Altera Arria 10 AX115N FPGA board at 150 MHz and compared it to previous approaches. The experimental results show that ShuntFlow provides a tremendous throughput and latency advantage over CPU and GPU implementations on both reduce-like and index-like SWAGs.
{"title":"ShuntFlow","authors":"Shijun Gong, Jiajun Li, Wenyan Lu, Guihai Yan, Xiaowei Li","doi":"10.1145/3316781.3317910","DOIUrl":"https://doi.org/10.1145/3316781.3317910","url":null,"abstract":"Streaming processing is an important and growing class of applications for analyzing continuous streams of real time data. Slidingwindow aggregations (SWAGs) dominate the computation time in such applications and dictate an unprecedented computation capacity which poses a great challenge to the computing architectures. General-purpose processors cannot efficiently handle SWAGs because of the specific computation patterns. This paper proposes an efficient accelerator architecture for ubiquitous SWAGs, called ShuntFlow. ShuntFlow is a typical type of Kernel Processing Unit (KPU) where “Kernel” represent two main categories of SWAG operations widely used in streaming processing. Meanwhile, we propose a shunt rule to enable ShuntFlow to efficiently handle SWAGs with arbitrary parameters. As a case study, we implemented ShuntFlow on an Altera Arria 10 AX115N FPGA board at 150 MHz and compared it to previous approaches. The experimental results show that ShuntFlow provides a tremendous throughput and latency advantage over CPU and GPU implementations on both reduce-like and index-like SWAGs.","PeriodicalId":391209,"journal":{"name":"Proceedings of the 56th Annual Design Automation Conference 2019","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121493872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yih-Lang Li, Shih-Ting Lin, S. Nishizawa, Hong-Yan Su, Ming-Jie Fong, Oscar Chen, H. Onodera
For 7nm technology node, cell placement with drain-to-drain abutment (DDA) requires additional filler cells, increasing placement area. This is the first work to fully automatically synthesize a DDA-aware cell library with optimized number of drains on cell boundary based on ASAP 7nm PDK. We propose a DDA-aware dynamic programming based transistor placement. Previous works ignore the use of M0 layer in cell routing. We firstly propose an ILP-based M0 routing planning. With M0 routing, the congestion of M1 routing can be reduced and the pin accessibility can be improved due to the diminished use of M2 routing. To improve the routing resource utilization, we propose an implicitly adjustable grid map, making the maze routing able to explore more routing solutions. Experimental results show that block placement using the DDA-aware cell library requires less filler cells than that using traditional cell library by 70.9%, which achieves a block area reduction rate of 5.7%.
{"title":"NCTUcell","authors":"Yih-Lang Li, Shih-Ting Lin, S. Nishizawa, Hong-Yan Su, Ming-Jie Fong, Oscar Chen, H. Onodera","doi":"10.1145/3316781.3317868","DOIUrl":"https://doi.org/10.1145/3316781.3317868","url":null,"abstract":"For 7nm technology node, cell placement with drain-to-drain abutment (DDA) requires additional filler cells, increasing placement area. This is the first work to fully automatically synthesize a DDA-aware cell library with optimized number of drains on cell boundary based on ASAP 7nm PDK. We propose a DDA-aware dynamic programming based transistor placement. Previous works ignore the use of M0 layer in cell routing. We firstly propose an ILP-based M0 routing planning. With M0 routing, the congestion of M1 routing can be reduced and the pin accessibility can be improved due to the diminished use of M2 routing. To improve the routing resource utilization, we propose an implicitly adjustable grid map, making the maze routing able to explore more routing solutions. Experimental results show that block placement using the DDA-aware cell library requires less filler cells than that using traditional cell library by 70.9%, which achieves a block area reduction rate of 5.7%.","PeriodicalId":391209,"journal":{"name":"Proceedings of the 56th Annual Design Automation Conference 2019","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122710471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graph-based algorithms have gained significant interest in several application domains. Solutions addressing the computational efficiency of such algorithms have mostly relied on many-core architectures. Cleverly laying out input graphs in storage, by placing adjacent vertices in a same storage unit (memory bank or cache unit), enables fast access during graph traversal. Dynamic graphs, however, must be continuously repartitioned to leverage this benefit. Yet software repartitioning solutions rely on costly, cross-vault communication to query and optimize the graph layout between algorithm iterations. In this work, we propose DREDGE, a novel hardware solution to provide heuristic repartitioning optimizations in the background without extra communication. Our evaluation indicates that we achieve a $1.9 x$ speedup, on average, over several graph algorithms and datasets, executing on a 24x24-core architecture, when compared against a baseline solution that does not repartition the dynamic graph. We estimated that DREDGE incurs only 1.5% area and 2.1% power overheads over an ARM A5 processor core. CCS CONCEPTS • Hardware $rightarrow$ Hardware accelerators; Application specific processors; • Mathematics of computing $rightarrow$ Graph theory;
{"title":"DREDGE","authors":"Andrew McCrabb, Eric Winsor, V. Bertacco","doi":"10.1145/3316781.3317804","DOIUrl":"https://doi.org/10.1145/3316781.3317804","url":null,"abstract":"Graph-based algorithms have gained significant interest in several application domains. Solutions addressing the computational efficiency of such algorithms have mostly relied on many-core architectures. Cleverly laying out input graphs in storage, by placing adjacent vertices in a same storage unit (memory bank or cache unit), enables fast access during graph traversal. Dynamic graphs, however, must be continuously repartitioned to leverage this benefit. Yet software repartitioning solutions rely on costly, cross-vault communication to query and optimize the graph layout between algorithm iterations. In this work, we propose DREDGE, a novel hardware solution to provide heuristic repartitioning optimizations in the background without extra communication. Our evaluation indicates that we achieve a $1.9 x$ speedup, on average, over several graph algorithms and datasets, executing on a 24x24-core architecture, when compared against a baseline solution that does not repartition the dynamic graph. We estimated that DREDGE incurs only 1.5% area and 2.1% power overheads over an ARM A5 processor core. CCS CONCEPTS • Hardware $rightarrow$ Hardware accelerators; Application specific processors; • Mathematics of computing $rightarrow$ Graph theory;","PeriodicalId":391209,"journal":{"name":"Proceedings of the 56th Annual Design Automation Conference 2019","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124958807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in continuous-flow microfluidics have enabled highly integrated lab-on-a-chip biochips. These chips can execute complex biochemical applications precisely and efficiently within a tiny area, but they require a large number of control ports and the corresponding control logic to generate required pressure patterns for flow control, which, consequently, offset their advantages and prevent their wide adoption. In this paper, we propose the first synthesis flow called MiniControl, for continuous-flow microfluidic biochips (CFMBs) under strict constraints for control ports, incorporating high-level synthesis and physical design simultaneously, which has never been considered in previous work. With the maximum number of allowed control ports specified in advance, this synthesis flow generates a biochip architecture with high execution efficiency. Moreover, the overall cost of a CFMB can be reduced and the tradeoff between control logic and execution efficiency of biochemical applications can be evaluated for the first time. Experimental results demonstrate that MiniControlleads to high execution efficiency and low overall platform cost, while satisfying the given control port constraint strictly.
{"title":"MiniControl","authors":"Xing Huang, Tsung-Yi Ho, Wenzhong Guo, Bing Li, Ulf Schlichtmann","doi":"10.1145/3316781.3317864","DOIUrl":"https://doi.org/10.1145/3316781.3317864","url":null,"abstract":"Recent advances in continuous-flow microfluidics have enabled highly integrated lab-on-a-chip biochips. These chips can execute complex biochemical applications precisely and efficiently within a tiny area, but they require a large number of control ports and the corresponding control logic to generate required pressure patterns for flow control, which, consequently, offset their advantages and prevent their wide adoption. In this paper, we propose the first synthesis flow called MiniControl, for continuous-flow microfluidic biochips (CFMBs) under strict constraints for control ports, incorporating high-level synthesis and physical design simultaneously, which has never been considered in previous work. With the maximum number of allowed control ports specified in advance, this synthesis flow generates a biochip architecture with high execution efficiency. Moreover, the overall cost of a CFMB can be reduced and the tradeoff between control logic and execution efficiency of biochemical applications can be evaluated for the first time. Experimental results demonstrate that MiniControlleads to high execution efficiency and low overall platform cost, while satisfying the given control port constraint strictly.","PeriodicalId":391209,"journal":{"name":"Proceedings of the 56th Annual Design Automation Conference 2019","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116360028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gagandeep Singh, Juan Gómez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, S. Stuijk, Onur Mutlu, H. Corporaal
The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies work-loads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on micro architectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.
{"title":"NAPEL","authors":"Gagandeep Singh, Juan Gómez-Luna, Giovanni Mariani, Geraldo F. Oliveira, Stefano Corda, S. Stuijk, Onur Mutlu, H. Corporaal","doi":"10.1145/3316781.3317867","DOIUrl":"https://doi.org/10.1145/3316781.3317867","url":null,"abstract":"The cost of moving data between the memory/storage units and the compute units is a major contributor to the execution time and energy consumption of modern workloads in computing systems. A promising paradigm to alleviate this data movement bottleneck is near-memory computing (NMC), which consists of placing compute units close to the memory/storage units. There is substantial research effort that proposes NMC architectures and identifies work-loads that can benefit from NMC. System architects typically use simulation techniques to evaluate the performance and energy consumption of their designs. However, simulation is extremely slow, imposing long times for design space exploration. In order to enable fast early-stage design space exploration of NMC architectures, we need high-level performance and energy models.We present NAPEL, a high-level performance and energy estimation framework for NMC architectures. NAPEL leverages ensemble learning to develop a model that is based on micro architectural parameters and application characteristics. NAPEL training uses a statistical technique, called design of experiments, to collect representative training data efficiently. NAPEL provides early design space exploration 220× faster than a state-of-the-art NMC simulator, on average, with error rates of to 8.5% and 11.6% for performance and energy estimations, respectively, compared to the NMC simulator. NAPEL is also capable of making accurate predictions for previously-unseen applications.","PeriodicalId":391209,"journal":{"name":"Proceedings of the 56th Annual Design Automation Conference 2019","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133114208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bio-signals exhibit high redundancy, and the algorithms for their processing are inherently error resilient. This property can be leveraged to improve the energy-efficiency of IoT-Edge (wearables) through the emerging trend of approximate computing. This paper presents XBioSiP, a novel methodology for approximate bio-signal processing that employs two quality evaluation stages, during the pre-processing and bio-signal processing stages, to determine the approximation parameters. It thereby achieves high energy savings while satisfying the user-determined quality constraint. Our methodology achieves, up to $19 times$ and $22 times$ reduction in the energy consumption of a QRS peak detection algorithm for 0% and <1% loss in peak detection accuracy, respectively.
{"title":"XBioSiP","authors":"B. Prabakaran, Semeen Rehman, M. Shafique","doi":"10.1145/3316781.3317933","DOIUrl":"https://doi.org/10.1145/3316781.3317933","url":null,"abstract":"Bio-signals exhibit high redundancy, and the algorithms for their processing are inherently error resilient. This property can be leveraged to improve the energy-efficiency of IoT-Edge (wearables) through the emerging trend of approximate computing. This paper presents XBioSiP, a novel methodology for approximate bio-signal processing that employs two quality evaluation stages, during the pre-processing and bio-signal processing stages, to determine the approximation parameters. It thereby achieves high energy savings while satisfying the user-determined quality constraint. Our methodology achieves, up to $19 times$ and $22 times$ reduction in the energy consumption of a QRS peak detection algorithm for 0% and <1% loss in peak detection accuracy, respectively.","PeriodicalId":391209,"journal":{"name":"Proceedings of the 56th Annual Design Automation Conference 2019","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127690246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}