Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105382
F. Firouzi, S. Kiamehr, M. Tahoori
Power supply noise in nano-scale VLSI is one of the design concerns. Due to switching current of various logic gates, the actual supply voltage seen by different devices fluctuates, causing extra delays and ultimately intermittent faults during operation. Therefore, accurate estimation of worst case scenario, maximum noise and the vectors causing it, is extremely important for design, verification, and manufacturing test steps. In this paper we present a mixed-integer linear programming modeling of power supply noise in digital circuits to obtain fast and accurate solutions. Compared with accurate SPICE simulations of random vectors for a set of benchmark circuits, the proposed approach can achieve 13115× speedup while obtains 2.7% more optimization in average.
{"title":"Modeling and estimation of power supply noise using linear programming","authors":"F. Firouzi, S. Kiamehr, M. Tahoori","doi":"10.1109/ICCAD.2011.6105382","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105382","url":null,"abstract":"Power supply noise in nano-scale VLSI is one of the design concerns. Due to switching current of various logic gates, the actual supply voltage seen by different devices fluctuates, causing extra delays and ultimately intermittent faults during operation. Therefore, accurate estimation of worst case scenario, maximum noise and the vectors causing it, is extremely important for design, verification, and manufacturing test steps. In this paper we present a mixed-integer linear programming modeling of power supply noise in digital circuits to obtain fast and accurate solutions. Compared with accurate SPICE simulations of random vectors for a set of benchmark circuits, the proposed approach can achieve 13115× speedup while obtains 2.7% more optimization in average.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"7 1","pages":"537-542"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81865672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105420
E. Kilada, K. Stevens
Latency insensitive (LI) designs can tolerate arbitrary computation and communication latencies. Synchronous elasticization converts an ordinary clocked design into LI. It uses communication protocols such as the Synchronous Elastic Flow (SELF). Comparing to its lazy implementations, eager SELF has no combinational cycles and can provide performance advantage. Yet, it uses eager forks (EForks) consuming more area and power. This paper demonstrates that EForks can be redundant. A novel ultra simple fork (USFork) implementation is introduced. The conditions under which an EFork will behave exactly the same as a USFork (from the protocol perspective) are formally derived. The paper also investigates the conditions under which multiple SELF controllers can be merged to further decrease the area and power overhead (as long as the physical placement allows). The flow has been integrated in a fully automated tool, HGEN. Hybrid GENerator (HGEN) selectively replaces redundant EForks with USForks and, optionally, merges equivalent controllers. HGEN uses 6thSense tool as an embedded verification engine. Comparing to the methodology used in published work on a MiniMIPS processor case study, HGEN shows up to 34.3% and 25.4% savings in area and power due to utilizing USForks. It also shows at least 32% saving in the number of EForks in s382 ISCAS benchmark. More reduction is possible if the physical placement allows for controller merging. Thanks to the advance in synchronous verification technology, HGEN runs within a few minutes (for all this paper examples). This makes the proposed approach suitable for tight time-to-market constraints.
{"title":"Synchronous elasticization at a reduced cost: Utilizing the ultra simple fork and controller merging","authors":"E. Kilada, K. Stevens","doi":"10.1109/ICCAD.2011.6105420","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105420","url":null,"abstract":"Latency insensitive (LI) designs can tolerate arbitrary computation and communication latencies. Synchronous elasticization converts an ordinary clocked design into LI. It uses communication protocols such as the Synchronous Elastic Flow (SELF). Comparing to its lazy implementations, eager SELF has no combinational cycles and can provide performance advantage. Yet, it uses eager forks (EForks) consuming more area and power. This paper demonstrates that EForks can be redundant. A novel ultra simple fork (USFork) implementation is introduced. The conditions under which an EFork will behave exactly the same as a USFork (from the protocol perspective) are formally derived. The paper also investigates the conditions under which multiple SELF controllers can be merged to further decrease the area and power overhead (as long as the physical placement allows). The flow has been integrated in a fully automated tool, HGEN. Hybrid GENerator (HGEN) selectively replaces redundant EForks with USForks and, optionally, merges equivalent controllers. HGEN uses 6thSense tool as an embedded verification engine. Comparing to the methodology used in published work on a MiniMIPS processor case study, HGEN shows up to 34.3% and 25.4% savings in area and power due to utilizing USForks. It also shows at least 32% saving in the number of EForks in s382 ISCAS benchmark. More reduction is possible if the physical placement allows for controller merging. Thanks to the advance in synchronous verification technology, HGEN runs within a few minutes (for all this paper examples). This makes the proposed approach suitable for tight time-to-market constraints.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"7 1","pages":"794-801"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82051795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105321
Wangyang Zhang, K. Balakrishnan, Xin Li, D. Boning, Rob A. Rutenbar
In this paper, we propose a new technique to accurately decompose process variation into two different components: (1) spatially correlated variation, and (2) uncorrelated random variation. Such variation decomposition is important to identify systematic variation patterns at wafer and/or chip level for process modeling, control and diagnosis. We demonstrate that spatially correlated variation carries a unique sparse signature in frequency domain. Based upon this observation, an efficient sparse regression algorithm is applied to accurately separate spatially correlated variation from uncorrelated random variation. An important contribution of this paper is to develop a fast numerical algorithm that reduces the computational time of sparse regression by several orders of magnitude over the traditional implementation. Our experimental results based on silicon measurement data demonstrate that the proposed sparse regression technique can capture spatially correlated variation patterns with high accuracy. The estimation error is reduced by more than 3.5× compared to other traditional methods.
{"title":"Toward efficient spatial variation decomposition via sparse regression","authors":"Wangyang Zhang, K. Balakrishnan, Xin Li, D. Boning, Rob A. Rutenbar","doi":"10.1109/ICCAD.2011.6105321","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105321","url":null,"abstract":"In this paper, we propose a new technique to accurately decompose process variation into two different components: (1) spatially correlated variation, and (2) uncorrelated random variation. Such variation decomposition is important to identify systematic variation patterns at wafer and/or chip level for process modeling, control and diagnosis. We demonstrate that spatially correlated variation carries a unique sparse signature in frequency domain. Based upon this observation, an efficient sparse regression algorithm is applied to accurately separate spatially correlated variation from uncorrelated random variation. An important contribution of this paper is to develop a fast numerical algorithm that reduces the computational time of sparse regression by several orders of magnitude over the traditional implementation. Our experimental results based on silicon measurement data demonstrate that the proposed sparse regression technique can capture spatially correlated variation patterns with high accuracy. The estimation error is reduced by more than 3.5× compared to other traditional methods.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"61 1","pages":"162-169"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80283932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105305
Hui Zhao, Akbar Sharifi, Shekhar Srikantaiah, M. Kandemir
Focusing on data reliability, we propose a control theory centric approach designed to improve transient error resilience in shared caches of emerging multicores while satisfying performance goals. The proposed scheme takes, as input, two quality of service (QoS) specifications: performance QoS and reliability QoS. The first of these indicates the minimum workload-wide cache (L2) hit rate value acceptable, whereas the second one captures the reliability bound on an application basis, with the help of a metric called the Reads-with-Replica (RwR). We present an extensive experimental evaluation of the proposed scheme on various workloads formed using the applications from the SPEC2006 benchmark suite. The proposed scheme is able to satisfy, in most of the tested cases, both performance and reliability QoS targets, by successfully modulating the total size of the data replication area and partitioning of this area among the co-runner applications. The collected results also show that our scheme achieves consistent improvements under different values of the major simulation parameters.
{"title":"Feedback control based cache reliability enhancement for emerging multicores","authors":"Hui Zhao, Akbar Sharifi, Shekhar Srikantaiah, M. Kandemir","doi":"10.1109/ICCAD.2011.6105305","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105305","url":null,"abstract":"Focusing on data reliability, we propose a control theory centric approach designed to improve transient error resilience in shared caches of emerging multicores while satisfying performance goals. The proposed scheme takes, as input, two quality of service (QoS) specifications: performance QoS and reliability QoS. The first of these indicates the minimum workload-wide cache (L2) hit rate value acceptable, whereas the second one captures the reliability bound on an application basis, with the help of a metric called the Reads-with-Replica (RwR). We present an extensive experimental evaluation of the proposed scheme on various workloads formed using the applications from the SPEC2006 benchmark suite. The proposed scheme is able to satisfy, in most of the tested cases, both performance and reliability QoS targets, by successfully modulating the total size of the data replication area and partitioning of this area among the co-runner applications. The collected results also show that our scheme achieves consistent improvements under different values of the major simulation parameters.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"22 6 1","pages":"56-62"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82919178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105324
J. Cong, Peng Zhang, Yi Zou
External memory bandwidth is a crucial bottleneck in the majority of computation-intensive applications for both performance and power consumption. Data reuse is an important technique for reducing the external memory access by utilizing the memory hierarchy. Loop transformation for data locality and memory hierarchy allocation are two major steps in data reuse optimization flow. But they were carried out independently. This paper presents a combined approach which optimizes loop transformation and memory hierarchy allocation simultaneously to achieve global optimal results on external memory bandwidth and on-chip data reuse buffer size. We develop an efficient and optimal solution to the combined problem by decomposing the solution space into two subspaces with linear and nonlinear constraints respectively. We show that we can significantly prune the solution space without losing its optimality. Experimental results show that our scheme can save up to 31% of on-chip memory size compared to the separated two-step method when the memory hierarchy allocation problem is not trivial. Also, run-time complexity is acceptable for the practical cases.
{"title":"Combined loop transformation and hierarchy allocation for data reuse optimization","authors":"J. Cong, Peng Zhang, Yi Zou","doi":"10.1109/ICCAD.2011.6105324","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105324","url":null,"abstract":"External memory bandwidth is a crucial bottleneck in the majority of computation-intensive applications for both performance and power consumption. Data reuse is an important technique for reducing the external memory access by utilizing the memory hierarchy. Loop transformation for data locality and memory hierarchy allocation are two major steps in data reuse optimization flow. But they were carried out independently. This paper presents a combined approach which optimizes loop transformation and memory hierarchy allocation simultaneously to achieve global optimal results on external memory bandwidth and on-chip data reuse buffer size. We develop an efficient and optimal solution to the combined problem by decomposing the solution space into two subspaces with linear and nonlinear constraints respectively. We show that we can significantly prune the solution space without losing its optimality. Experimental results show that our scheme can save up to 31% of on-chip memory size compared to the separated two-step method when the memory hierarchy allocation problem is not trivial. Also, run-time complexity is acceptable for the practical cases.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"1 1","pages":"185-192"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88183370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105385
M. Pathak, Jiwoo Pak, D. Pan, S. Lim
Electromigration (EM) is a critical problem for interconnect reliability of modern integrated circuits (ICs), especially as the feature size becomes smaller. In three-dimensional (3D) IC technology, the EM problem becomes more severe due to drastic dimension mismatches between metal wires, through silicon vias (TSVs), and landing pads. Meanwhile, the thermo-mechanical stress due to the TSV can also cause reduction in the failure time of wires. However, there is very little study on EM issues that consider TSVs in 3D ICs. In this paper, we show the impact of TSV stress on EM failure time of metal wires in 3D ICs. We model the impact of TSV on stress variation in wires. We then perform detailed modeling of the impact of stress on EM failure time of metal wires. Based on our analysis, we build a detailed library to predict the failure time of a given wire based on current density, temperature and stress. We then propose a method to perform fast full-chip simulation, to determine the various EM related hot-spots in the design. We also propose a simple routing-blockage scheme to reduce the EM related failures near the TSVs, and see its impact on various metrics.
{"title":"Electromigration modeling and full-chip reliability analysis for BEOL interconnect in TSV-based 3D ICs","authors":"M. Pathak, Jiwoo Pak, D. Pan, S. Lim","doi":"10.1109/ICCAD.2011.6105385","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105385","url":null,"abstract":"Electromigration (EM) is a critical problem for interconnect reliability of modern integrated circuits (ICs), especially as the feature size becomes smaller. In three-dimensional (3D) IC technology, the EM problem becomes more severe due to drastic dimension mismatches between metal wires, through silicon vias (TSVs), and landing pads. Meanwhile, the thermo-mechanical stress due to the TSV can also cause reduction in the failure time of wires. However, there is very little study on EM issues that consider TSVs in 3D ICs. In this paper, we show the impact of TSV stress on EM failure time of metal wires in 3D ICs. We model the impact of TSV on stress variation in wires. We then perform detailed modeling of the impact of stress on EM failure time of metal wires. Based on our analysis, we build a detailed library to predict the failure time of a given wire based on current density, temperature and stress. We then propose a method to perform fast full-chip simulation, to determine the various EM related hot-spots in the design. We also propose a simple routing-blockage scheme to reduce the EM related failures near the TSVs, and see its impact on various metrics.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"15 1","pages":"555-562"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88269149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105403
M. Velev, Ping Gao
We present highly automatic techniques for formal verification of pipelined microprocessors with hardware support for multithreading. The processors are modeled at a high level of abstraction, using a subset of Verilog, in a way that allows us to exploit the property of Positive Equality that results in significant simplifications of the solution space, and orders of magnitude speedup relative to previous methods. We propose abstraction techniques that produce at least 3 orders of magnitude speedup, which is increasing with the number of threads implemented in a pipelined processor. To the best of our knowledge, this is the first work on automatic formal verification of pipelined processors with hardware support for multithreading.
{"title":"Automatic formal verification of multithreaded pipelined microprocessors","authors":"M. Velev, Ping Gao","doi":"10.1109/ICCAD.2011.6105403","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105403","url":null,"abstract":"We present highly automatic techniques for formal verification of pipelined microprocessors with hardware support for multithreading. The processors are modeled at a high level of abstraction, using a subset of Verilog, in a way that allows us to exploit the property of Positive Equality that results in significant simplifications of the solution space, and orders of magnitude speedup relative to previous methods. We propose abstraction techniques that produce at least 3 orders of magnitude speedup, which is increasing with the number of threads implemented in a pipelined processor. To the best of our knowledge, this is the first work on automatic formal verification of pipelined processors with hardware support for multithreading.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"45 1","pages":"679-686"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77529348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105398
Hsuan-Ming Chou, Hao Yu, Shih-Chieh Chang
Instead of minimizing clock skew, skew can be useful to improve circuit performance. However, it is difficult to apply useful skew to a design with complicated power modes. With only one clock tree, useful skew in one power mode may be harmful in another power mode. In this paper, we propose to use adjustable delay buffers (ADBs) to construct a tunable clock tree so that useful skew can be assigned for different power modes. Assuming positions of ADBs are determined, we assign delays of ADBs for each power mode by LP. Then a speedup theorem is proposed to greatly reduce LP inequalities. We also propose an efficient method to select positions of ADBs. Our experimental results show that average 99.45% inequities are decreased and an average performance improvement of 27.35% is obtained compared with commercial tool SOC Encounter™.
{"title":"Useful-skew clock optimization for multi-power mode designs","authors":"Hsuan-Ming Chou, Hao Yu, Shih-Chieh Chang","doi":"10.1109/ICCAD.2011.6105398","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105398","url":null,"abstract":"Instead of minimizing clock skew, skew can be useful to improve circuit performance. However, it is difficult to apply useful skew to a design with complicated power modes. With only one clock tree, useful skew in one power mode may be harmful in another power mode. In this paper, we propose to use adjustable delay buffers (ADBs) to construct a tunable clock tree so that useful skew can be assigned for different power modes. Assuming positions of ADBs are determined, we assign delays of ADBs for each power mode by LP. Then a speedup theorem is proposed to greatly reduce LP inequalities. We also propose an efficient method to select positions of ADBs. Our experimental results show that average 99.45% inequities are decreased and an average performance improvement of 27.35% is obtained compared with commercial tool SOC Encounter™.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"39 1","pages":"647-650"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90915888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105306
R. Topaloglu, Benedict R. Gaster
Graphical processing unit (GPU) computing has been an interesting area of research in the last few years. While initial adapters of the technology have been from image processing domain due to difficulties in programming the GPUs, research on programming languages made it possible for people without the knowledge of low-level programming languages such as OpenGL develop code on GPUs. Two main GPU architectures from AMD (former ATI) and NVIDIA acquired grounds. AMD adapted Stanford's Brook language and made it into an architecture-agnostic programming model. NVIDIA, on the other hand, brought CUDA framework to a wide audience. While the two languages have their pros and cons, such as Brook not being able to scale as well and CUDA having to account for architectural-level decisions, it has not been possible to compile one code on another architecture or across platforms. Another opportunity came with the introduction of the idea of combining one or more CPUs and GPUs on the same die. Eliminating some of the interconnection bandwidth issues, this combination makes it possible to offload tasks with high parallelism to the GPU. The technological direction towards multicores for CPU-only architectures also require a programming methodology change and act as a catalyst for suitable programming languages. Hence, a unified language that can be used both on multiple core CPUs as well as GPUs and their combinations has gained interest. Open Computing Language (OpenCL), developed originally by the Khronos Group of Apple and supported by both AMD and NVIDIA, is seen as the programming language of choice for parallel programming. In this paper, we provide a motivation for our tutorial talk on usage of OpenCL for GPUs and highlight key features of the language. We provide research directions on OpenCL for EDA. In our tutorial talk, we use EDA as our application domain to get the readers started with programming the rising language of parallelism, OpenCL.
{"title":"GPU programming for EDA with OpenCL","authors":"R. Topaloglu, Benedict R. Gaster","doi":"10.1109/ICCAD.2011.6105306","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105306","url":null,"abstract":"Graphical processing unit (GPU) computing has been an interesting area of research in the last few years. While initial adapters of the technology have been from image processing domain due to difficulties in programming the GPUs, research on programming languages made it possible for people without the knowledge of low-level programming languages such as OpenGL develop code on GPUs. Two main GPU architectures from AMD (former ATI) and NVIDIA acquired grounds. AMD adapted Stanford's Brook language and made it into an architecture-agnostic programming model. NVIDIA, on the other hand, brought CUDA framework to a wide audience. While the two languages have their pros and cons, such as Brook not being able to scale as well and CUDA having to account for architectural-level decisions, it has not been possible to compile one code on another architecture or across platforms. Another opportunity came with the introduction of the idea of combining one or more CPUs and GPUs on the same die. Eliminating some of the interconnection bandwidth issues, this combination makes it possible to offload tasks with high parallelism to the GPU. The technological direction towards multicores for CPU-only architectures also require a programming methodology change and act as a catalyst for suitable programming languages. Hence, a unified language that can be used both on multiple core CPUs as well as GPUs and their combinations has gained interest. Open Computing Language (OpenCL), developed originally by the Khronos Group of Apple and supported by both AMD and NVIDIA, is seen as the programming language of choice for parallel programming. In this paper, we provide a motivation for our tutorial talk on usage of OpenCL for GPUs and highlight key features of the language. We provide research directions on OpenCL for EDA. In our tutorial talk, we use EDA as our application domain to get the readers started with programming the rising language of parallelism, OpenCL.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"24 1","pages":"63-66"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91121458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105342
I. Markov, Dongjin Lee
This mini-tutorial covers recent research on clock-network tuning. It starts with SPICE-accurate optimizations used in winning entries at the ISPD 2009 and 2010 clock-network synthesis contests. After comparing clock trees to meshes, it outlines a recent redundant clock-network topology that retains most advantages of clock trees, but improves robustness to PVT variations. It also shows how to incorporate clock-network synthesis into global placement to reduce dynamic power and insertion delay.
{"title":"Algorithmic tuning of clock trees and derived non-tree structures","authors":"I. Markov, Dongjin Lee","doi":"10.1109/ICCAD.2011.6105342","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105342","url":null,"abstract":"This mini-tutorial covers recent research on clock-network tuning. It starts with SPICE-accurate optimizations used in winning entries at the ISPD 2009 and 2010 clock-network synthesis contests. After comparing clock trees to meshes, it outlines a recent redundant clock-network topology that retains most advantages of clock trees, but improves robustness to PVT variations. It also shows how to incorporate clock-network synthesis into global placement to reduce dynamic power and insertion delay.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"32 1","pages":"279-282"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91219504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}