Boolean matching is significant to digital integrated circuits design. An exhaustive method for Boolean matching is computationally expensive even for functions with only a few variables, because the time complexity of such an algorithm for an n-variable Boolean function is O(2n+1n!). Sensitivity is an important characteristic and a measure of the complexity of Boolean functions. It has been used in analysis of the complexity of algorithms in different fields. This measure could be regarded as a signature of Boolean functions and has great potential to help reduce the search space of Boolean matching. In this paper, we introduce Boolean sensitivity into Boolean matching and design several sensitivity-related signatures to enhance fast Boolean matching. First, we propose some new signatures that relate sensitivity to Boolean equivalence. Then, we prove that these signatures are prerequisites for Boolean matching, which we can use to reduce the search space of the matching problem. Besides, we develop a fast sensitivity calculation method to compute and compare these signatures of two Boolean functions. Compared with the traditional cofactor and symmetric detection methods, sensitivity is a series of signatures of another dimension. We also show that sensitivity can be easily integrated into traditional methods and distinguish the mismatched Boolean functions faster. To the best of our knowledge, this is the first work that introduces sensitivity to Boolean matching. The experimental results show that sensitivity-related signatures we proposed in this paper can reduce the search space to a very large extent, and perform up to 3x speedup over the state-of-the-art Boolean matching methods.
{"title":"Enhanced Fast Boolean Matching based on Sensitivity Signatures Pruning","authors":"Jiaxi Zhang, Liwei Ni, Shenggen Zheng, Hao Liu, Xiangfu Zou, Feng Wang, Guojie Luo","doi":"10.1109/ICCAD51958.2021.9643587","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643587","url":null,"abstract":"Boolean matching is significant to digital integrated circuits design. An exhaustive method for Boolean matching is computationally expensive even for functions with only a few variables, because the time complexity of such an algorithm for an n-variable Boolean function is O(2n+1n!). Sensitivity is an important characteristic and a measure of the complexity of Boolean functions. It has been used in analysis of the complexity of algorithms in different fields. This measure could be regarded as a signature of Boolean functions and has great potential to help reduce the search space of Boolean matching. In this paper, we introduce Boolean sensitivity into Boolean matching and design several sensitivity-related signatures to enhance fast Boolean matching. First, we propose some new signatures that relate sensitivity to Boolean equivalence. Then, we prove that these signatures are prerequisites for Boolean matching, which we can use to reduce the search space of the matching problem. Besides, we develop a fast sensitivity calculation method to compute and compare these signatures of two Boolean functions. Compared with the traditional cofactor and symmetric detection methods, sensitivity is a series of signatures of another dimension. We also show that sensitivity can be easily integrated into traditional methods and distinguish the mismatched Boolean functions faster. To the best of our knowledge, this is the first work that introduces sensitivity to Boolean matching. The experimental results show that sensitivity-related signatures we proposed in this paper can reduce the search space to a very large extent, and perform up to 3x speedup over the state-of-the-art Boolean matching methods.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129581650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643438
Tsun-Ming Tseng, M. Lian, Mengchu Li, P. Rinklin, Leroy Grob, B. Wolfrum, Ulf Schlichtmann, P. Rinklin
Inkjet-printed electronics have attracted considerable attention for low-cost mass production. To avoid undesired device behavior due to accidental ink merging and redistribution, high-density designs can benefit from layering and drying in batches. The overall manufacturing cycle-time, however, now becomes dominated by the cumulative drying time of these individual layers. The state-of-the-art approach decomposes the whole design, arranges the modified objects in different layers, and minimizes the number of layers. Fewer layers imply a reduction in the number of printing iterations and thus a higher manufacturing efficiency. Nevertheless, printing objects with significantly different drying dynamics in the same layer leads to a reduction of manufacturing efficiency, since the longest drying object in a given layer dominates the time required for this layer to dry. Consequently, an accurate estimation of the individual layers' drying time is indispensable to minimize the manufacturing cycle-time. To this end, we propose the first Gaussian drying model to evaluate the local evaporation rate in the drying process. Specifically, we estimate the drying time depending on the number, area, and distribution of the objects in a given layer. Finally, we minimize the total drying time by assigning to-be-printed objects to different layers with mixed-integer-linear programming (MILP) methods. Experimental results demonstrate that our Gaussian drying model closely approximates the actual drying process. In particular, comparing the non-optimized fabrication to the optimized results demonstrates that our method is able to reduce the drying time by 39%.
{"title":"Manufacturing Cycle-Time Optimization Using Gaussian Drying Model for Inkjet-Printed Electronics","authors":"Tsun-Ming Tseng, M. Lian, Mengchu Li, P. Rinklin, Leroy Grob, B. Wolfrum, Ulf Schlichtmann, P. Rinklin","doi":"10.1109/ICCAD51958.2021.9643438","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643438","url":null,"abstract":"Inkjet-printed electronics have attracted considerable attention for low-cost mass production. To avoid undesired device behavior due to accidental ink merging and redistribution, high-density designs can benefit from layering and drying in batches. The overall manufacturing cycle-time, however, now becomes dominated by the cumulative drying time of these individual layers. The state-of-the-art approach decomposes the whole design, arranges the modified objects in different layers, and minimizes the number of layers. Fewer layers imply a reduction in the number of printing iterations and thus a higher manufacturing efficiency. Nevertheless, printing objects with significantly different drying dynamics in the same layer leads to a reduction of manufacturing efficiency, since the longest drying object in a given layer dominates the time required for this layer to dry. Consequently, an accurate estimation of the individual layers' drying time is indispensable to minimize the manufacturing cycle-time. To this end, we propose the first Gaussian drying model to evaluate the local evaporation rate in the drying process. Specifically, we estimate the drying time depending on the number, area, and distribution of the objects in a given layer. Finally, we minimize the total drying time by assigning to-be-printed objects to different layers with mixed-integer-linear programming (MILP) methods. Experimental results demonstrate that our Gaussian drying model closely approximates the actual drying process. In particular, comparing the non-optimized fabrication to the optimized results demonstrates that our method is able to reduce the drying time by 39%.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"211 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133388137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643453
Shiqing Li, Di Liu, Weichen Liu
Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA's large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements' values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix's nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work.
{"title":"Optimized Data Reuse via Reordering for Sparse Matrix-Vector Multiplication on FPGAs","authors":"Shiqing Li, Di Liu, Weichen Liu","doi":"10.1109/ICCAD51958.2021.9643453","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643453","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is of paramount importance in both scientific and engineering applications. The main workload of SpMV is multiplications between randomly distributed nonzero elements in sparse matrices and their corresponding vector elements. Due to irregular data access patterns of vector elements and the limited memory bandwidth, the computational throughput of CPUs and GPUs is lower than the peak performance offered by FPGAs. FPGA's large on-chip memory allows the input vector to be buffered on-chip and hence the off-chip memory bandwidth is only utilized to transfer the nonzero elements' values, column indices, and row indices. Multiple nonzero elements are transmitted to FPGA and then their corresponding vector elements are accessed per cycle. However, typical on-chip block RAMs (BRAM) in FPGAs only have two access ports. The mismatch between off-chip memory bandwidth and on-chip memory ports stalls the whole engine, resulting in inefficient utilization of off-chip memory bandwidth. In this work, we reorder the nonzero elements to optimize data reuse for SpMV on FPGAs. The key observation is that since the vector elements can be reused for nonzero elements with the same column index, memory requests of these elements can be omitted by reusing the fetched data. Based on this observation, a novel compressed format is proposed to optimize data reuse by reordering the matrix's nonzero elements. Further, to support the compressed format, we design a scalable hardware accelerator and implement it on the Xilinx UltraScale ZCU106 platform. We evaluate the proposed design with a set of matrices from the University of Florida sparse matrix collection. The experimental results show that the proposed design achieves an average 1.22x performance speedup w.r.t. the state-of-the-art work.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133246404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643568
Kai-Shun Hu, Tao-Chun Yu, Ming Yang, Cindy Chin-Fang Shen
2021 ICCAD CAD Contest Problem B is an extended problem from 2020 ICCAD CAD Contest Problem B [1]–[2] for addressing more complex constraints. In the physical implementation, the common approach is to divide the problem into the placement and routing stage. By doing this divide-and-conquer approach, it may cause conservative margin reservation and miscorrelation. In order to achieve multiple advanced objectives in terms of Power, timing Performance and Area, so called PPA, a certain amount of cell movement at the routing stage become a desired functionality in an EDA tool. 2021 ICCAD CAD Contest Problem B encourages the research in the techniques of routing with cell movement to achieve multiple objectives in the advanced process nodes (less than 7 nm). We provided (i) a set of benchmarks and (ii) an evaluation metric of multiple objectives including power factor, the criticality of timing critical nets, the maximum number of moving cells, and the total routing length optimization that facilitate contestants to develop and test their new algorithms.
{"title":"2021 ICCAD CAD Contest Problem B: Routing with Cell Movement Advanced: Invited Paper","authors":"Kai-Shun Hu, Tao-Chun Yu, Ming Yang, Cindy Chin-Fang Shen","doi":"10.1109/ICCAD51958.2021.9643568","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643568","url":null,"abstract":"2021 ICCAD CAD Contest Problem B is an extended problem from 2020 ICCAD CAD Contest Problem B [1]–[2] for addressing more complex constraints. In the physical implementation, the common approach is to divide the problem into the placement and routing stage. By doing this divide-and-conquer approach, it may cause conservative margin reservation and miscorrelation. In order to achieve multiple advanced objectives in terms of Power, timing Performance and Area, so called PPA, a certain amount of cell movement at the routing stage become a desired functionality in an EDA tool. 2021 ICCAD CAD Contest Problem B encourages the research in the techniques of routing with cell movement to achieve multiple objectives in the advanced process nodes (less than 7 nm). We provided (i) a set of benchmarks and (ii) an evaluation metric of multiple objectives including power factor, the criticality of timing critical nets, the maximum number of moving cells, and the total routing length optimization that facilitate contestants to develop and test their new algorithms.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132129314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643570
Mohammad Abdullah Al Shohel, Vidya A. Chhabria, N. Evmorfopoulos, S. Sapatnekar
Traditional methods that test for electromigration (EM) failure in multisegment interconnects, over the lifespan of an IC, are based on the use of the Blech criterion, followed by Black's equation. Such methods analyze each segment independently, but are well known to be inaccurate due to stress buildup over multiple segments. This paper introduces the new concept of boundary reflections of stress flow that ascribes a physical (wave-like) interpretation to the transient stress behavior in a finite multisegment line. This can provide a framework for deriving analytical expressions of transient EM stress for lines with any number of segments, which can also be tailored to include the appropriate number of terms for any desired level of accuracy. The proposed method is shown to have excellent accuracy, through evaluations against the FEM solver COMSOL, as well as scalability, through its application on large power grid benchmarks.
{"title":"Analytical Modeling of Transient Electromigration Stress based on Boundary Reflections","authors":"Mohammad Abdullah Al Shohel, Vidya A. Chhabria, N. Evmorfopoulos, S. Sapatnekar","doi":"10.1109/ICCAD51958.2021.9643570","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643570","url":null,"abstract":"Traditional methods that test for electromigration (EM) failure in multisegment interconnects, over the lifespan of an IC, are based on the use of the Blech criterion, followed by Black's equation. Such methods analyze each segment independently, but are well known to be inaccurate due to stress buildup over multiple segments. This paper introduces the new concept of boundary reflections of stress flow that ascribes a physical (wave-like) interpretation to the transient stress behavior in a finite multisegment line. This can provide a framework for deriving analytical expressions of transient EM stress for lines with any number of segments, which can also be tailored to include the appropriate number of terms for any desired level of accuracy. The proposed method is shown to have excellent accuracy, through evaluations against the FEM solver COMSOL, as well as scalability, through its application on large power grid benchmarks.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133964634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643446
Amur Ghose, Vincent Zhang, Yingxue Zhang, Dong Li, Wulong Liu, M. Coates
Presently with technology node scaling, an accurate prediction model at early design stages can significantly reduce the design cycle. Especially during logic synthesis, predicting cell congestion due to improper logic combination can reduce the burden of subsequent physical implementations. There have been attempts using Graph Neural Network (GNN) techniques to tackle congestion prediction during the logic synthesis stage. However, they require informative cell features to achieve reasonable performance since the core idea of GNNs is built on the message passing framework, which would be impractical at the early logic synthesis stage. To address this limitation, we propose a framework that can directly learn embeddings for the given netlist to enhance the quality of our node features. Popular random-walk based embedding methods such as Node2vec, LINE, and DeepWalk suffer from the issue of cross-graph alignment and poor generalization to unseen netlist graphs, yielding inferior performance and costing significant runtime. In our framework, we introduce a superior alternative to obtain node embeddings that can generalize across netlist graphs using matrix factorization methods. We propose an efficient mini-batch training method at the sub-graph level that can guarantee parallel training and satisfy the memory restriction for large-scale netlists. We present results utilizing open-source EDA tools such as DREAMPLACE and OPENROAD frameworks on a variety of openly available circuits. By combining the learned embedding on top of the netlist with the GNNs, our method improves prediction performance, generalizes to new circuit lines, and is efficient in training, potentially saving over 90% of runtime.
{"title":"Generalizable Cross-Graph Embedding for GNN-based Congestion Prediction","authors":"Amur Ghose, Vincent Zhang, Yingxue Zhang, Dong Li, Wulong Liu, M. Coates","doi":"10.1109/ICCAD51958.2021.9643446","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643446","url":null,"abstract":"Presently with technology node scaling, an accurate prediction model at early design stages can significantly reduce the design cycle. Especially during logic synthesis, predicting cell congestion due to improper logic combination can reduce the burden of subsequent physical implementations. There have been attempts using Graph Neural Network (GNN) techniques to tackle congestion prediction during the logic synthesis stage. However, they require informative cell features to achieve reasonable performance since the core idea of GNNs is built on the message passing framework, which would be impractical at the early logic synthesis stage. To address this limitation, we propose a framework that can directly learn embeddings for the given netlist to enhance the quality of our node features. Popular random-walk based embedding methods such as Node2vec, LINE, and DeepWalk suffer from the issue of cross-graph alignment and poor generalization to unseen netlist graphs, yielding inferior performance and costing significant runtime. In our framework, we introduce a superior alternative to obtain node embeddings that can generalize across netlist graphs using matrix factorization methods. We propose an efficient mini-batch training method at the sub-graph level that can guarantee parallel training and satisfy the memory restriction for large-scale netlists. We present results utilizing open-source EDA tools such as DREAMPLACE and OPENROAD frameworks on a variety of openly available circuits. By combining the learned embedding on top of the netlist with the GNNs, our method improves prediction performance, generalizes to new circuit lines, and is efficient in training, potentially saving over 90% of runtime.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134178628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643481
Dan Zheng, Xinshi Zang, Martin D. F. Wong
As the complexity of circuit designs continues growing, multi-FPGA systems are becoming more and more popular for logic emulation and rapid prototyping. In a multi-FPGA system, different FPGAs are connected by limited physical wires, in other words, one FPGA usually has direct connections with only a few FPGAs. During the circuit partitioning stage, assigning two directly connected nodes to two FPGAs without physical links would significantly increase the delay and degrade the overall performance. However, some well-known partitioners, like hMETIS and PaToH, mainly focus on cut size minimization without considering such topology constraints of FPGAs, which limits their practical usage. In this paper, we propose a multi-level topology-driven partitioning framework, named as TopoPart, to deal with topology constraints in a multi-FPGA system. In particular, we firstly devise a candidate FPGA propagation algorithm in the coarsening phase to guarantee the later stages free of topology violations. In the last refinement phase, cut size is iteratively optimized maintaining both topology and resource constraints. Compared with the proposed baseline, our partitioning algorithm achieves zero topology violation while giving less cut size.
{"title":"TopoPart: a Multi-level Topology-Driven Partitioning Framework for Multi-FPGA Systems","authors":"Dan Zheng, Xinshi Zang, Martin D. F. Wong","doi":"10.1109/ICCAD51958.2021.9643481","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643481","url":null,"abstract":"As the complexity of circuit designs continues growing, multi-FPGA systems are becoming more and more popular for logic emulation and rapid prototyping. In a multi-FPGA system, different FPGAs are connected by limited physical wires, in other words, one FPGA usually has direct connections with only a few FPGAs. During the circuit partitioning stage, assigning two directly connected nodes to two FPGAs without physical links would significantly increase the delay and degrade the overall performance. However, some well-known partitioners, like hMETIS and PaToH, mainly focus on cut size minimization without considering such topology constraints of FPGAs, which limits their practical usage. In this paper, we propose a multi-level topology-driven partitioning framework, named as TopoPart, to deal with topology constraints in a multi-FPGA system. In particular, we firstly devise a candidate FPGA propagation algorithm in the coarsening phase to guarantee the later stages free of topology violations. In the last refinement phase, cut size is iteratively optimized maintaining both topology and resource constraints. Compared with the proposed baseline, our partitioning algorithm achieves zero topology violation while giving less cut size.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133466240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643452
Elbruz Ozen, A. Orailoglu
Sparse deep learning models are known to be more accurate than their dense counterparts for equal parameter and computational budgets. Unstructured model pruning can deliver dramatic compression rates, yet the consequent irregular sparsity patterns lead to severe computational challenges for modern computational hardware. Our work introduces a set of complementary sparsity patterns to construct both highly expressive and inherently regular sparse neural network layers. We propose a novel training approach to evolve inherently regular sparsity configurations and transform the expressive power of the proposed layers into a competitive classification accuracy even under extreme sparsity constraints. The structure of the introduced sparsity patterns engenders optimal compression of the layer parameters into a dense representation. Moreover, the constructed layers can be processed in the compressed format with full-hardware utilization in minimally modified non-sparse computational hardware. The experimental results demonstrate superior compression rates and remarkable performance improvements in sparse neural network inference in systolic arrays.
{"title":"Evolving Complementary Sparsity Patterns for Hardware-Friendly Inference of Sparse DNNs","authors":"Elbruz Ozen, A. Orailoglu","doi":"10.1109/ICCAD51958.2021.9643452","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643452","url":null,"abstract":"Sparse deep learning models are known to be more accurate than their dense counterparts for equal parameter and computational budgets. Unstructured model pruning can deliver dramatic compression rates, yet the consequent irregular sparsity patterns lead to severe computational challenges for modern computational hardware. Our work introduces a set of complementary sparsity patterns to construct both highly expressive and inherently regular sparse neural network layers. We propose a novel training approach to evolve inherently regular sparsity configurations and transform the expressive power of the proposed layers into a competitive classification accuracy even under extreme sparsity constraints. The structure of the introduced sparsity patterns engenders optimal compression of the layer parameters into a dense representation. Moreover, the constructed layers can be processed in the compressed format with full-hardware utilization in minimally modified non-sparse computational hardware. The experimental results demonstrate superior compression rates and remarkable performance improvements in sparse neural network inference in systolic arrays.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116118475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643459
Hsiao-Yin Tseng, I. Chiu, Mu-Ting Wu, C. Li
The demand for neuromorphic chips has skyrocketed in recent years. Thus, efficient manufacturing testing becomes an issue. Conventional testing cannot be applied because some neuromorphic chips do not have scan chains. However, traditional functional testing for neuromorphic chips suffers from long test length and low fault coverage. In this work, we propose a machine learning-based test pattern generation technique with behavior fault models. We use the concept of adversarial attack to generate test patterns to improve the fault coverage of existing functional test patterns. The effectiveness of the proposed technique is demonstrated on two Spiking Neural Network models trained on MNIST. Compared to traditional functional testing, our proposed technique reduces test length by 566x to 8,824x and improves fault coverage by 8.1% to 86.3% on five fault models. Finally, we propose a methodology to solve the scalability issue for the synapse fault models, resulting in 25.7x run time reduction on test pattern generation for synapse faults.
{"title":"Machine Learning-Based Test Pattern Generation for Neuromorphic Chips","authors":"Hsiao-Yin Tseng, I. Chiu, Mu-Ting Wu, C. Li","doi":"10.1109/ICCAD51958.2021.9643459","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643459","url":null,"abstract":"The demand for neuromorphic chips has skyrocketed in recent years. Thus, efficient manufacturing testing becomes an issue. Conventional testing cannot be applied because some neuromorphic chips do not have scan chains. However, traditional functional testing for neuromorphic chips suffers from long test length and low fault coverage. In this work, we propose a machine learning-based test pattern generation technique with behavior fault models. We use the concept of adversarial attack to generate test patterns to improve the fault coverage of existing functional test patterns. The effectiveness of the proposed technique is demonstrated on two Spiking Neural Network models trained on MNIST. Compared to traditional functional testing, our proposed technique reduces test length by 566x to 8,824x and improves fault coverage by 8.1% to 86.3% on five fault models. Finally, we propose a methodology to solve the scalability issue for the synapse fault models, resulting in 25.7x run time reduction on test pattern generation for synapse faults.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125741484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ICCAD51958.2021.9643454
Yawen Wu, Dewen Zeng, Zhepeng Wang, Yi Sheng, Lei Yang, Alaina J. James, Yiyu Shi, Jingtong Hu
Deep learning models have been deployed in an increasing number of edge and mobile devices to provide healthcare. These models rely on training with a tremendous amount of labeled data to achieve high accuracy. However, for medical applications such as dermatological disease diagnosis, the private data collected by mobile dermatology assistants exist on distributed mobile devices of patients, and each device only has a limited amount of data. Directly learning from limited data greatly deteriorates the performance of learned models. Federated learning (FL) can train models by using data distributed on devices while keeping the data local for privacy. Existing works on FL assume all the data have ground-truth labels. However, medical data often comes without any accompanying labels since labeling requires expertise and results in prohibitively high labor costs. The recently developed self-supervised learning approach, contrastive learning (CL), can leverage the unlabeled data to pre-train a model for learning data representations, after which the learned model can be fine-tuned on limited labeled data to perform dermatological disease diagnosis. However, simply combining CL with FL as federated contrastive learning (FCL) will result in ineffective learning since CL requires diverse data for accurate learning but each device in FL only has limited data diversity. In this work, we propose an on-device FCL framework for dermatological disease diagnosis with limited labels. Features are shared among devices in the FCL pre-training process to provide diverse and accurate contrastive information without sharing raw data for privacy. After that, the pre-trained model is fine-tuned with local labeled data independently on each device or collaboratively with supervised federated learning on all devices. Experiments on dermatological disease datasets show that the proposed framework effectively improves the recall and precision of dermatological disease diagnosis compared with state-of-the-art methods.
{"title":"Federated Contrastive Learning for Dermatological Disease Diagnosis via On-device Learning (Invited Paper)","authors":"Yawen Wu, Dewen Zeng, Zhepeng Wang, Yi Sheng, Lei Yang, Alaina J. James, Yiyu Shi, Jingtong Hu","doi":"10.1109/ICCAD51958.2021.9643454","DOIUrl":"https://doi.org/10.1109/ICCAD51958.2021.9643454","url":null,"abstract":"Deep learning models have been deployed in an increasing number of edge and mobile devices to provide healthcare. These models rely on training with a tremendous amount of labeled data to achieve high accuracy. However, for medical applications such as dermatological disease diagnosis, the private data collected by mobile dermatology assistants exist on distributed mobile devices of patients, and each device only has a limited amount of data. Directly learning from limited data greatly deteriorates the performance of learned models. Federated learning (FL) can train models by using data distributed on devices while keeping the data local for privacy. Existing works on FL assume all the data have ground-truth labels. However, medical data often comes without any accompanying labels since labeling requires expertise and results in prohibitively high labor costs. The recently developed self-supervised learning approach, contrastive learning (CL), can leverage the unlabeled data to pre-train a model for learning data representations, after which the learned model can be fine-tuned on limited labeled data to perform dermatological disease diagnosis. However, simply combining CL with FL as federated contrastive learning (FCL) will result in ineffective learning since CL requires diverse data for accurate learning but each device in FL only has limited data diversity. In this work, we propose an on-device FCL framework for dermatological disease diagnosis with limited labels. Features are shared among devices in the FCL pre-training process to provide diverse and accurate contrastive information without sharing raw data for privacy. After that, the pre-trained model is fine-tuned with local labeled data independently on each device or collaboratively with supervised federated learning on all devices. Experiments on dermatological disease datasets show that the proposed framework effectively improves the recall and precision of dermatological disease diagnosis compared with state-of-the-art methods.","PeriodicalId":370791,"journal":{"name":"2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121208993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}