Among the different methods of reducing power for core-based system-on-chip (SoC) designs, the voltage island technique has gained in popularity. Assigning cores to the different supply voltages and floorplanning to create contiguous voltage islands are the two important steps in the design process. We propose a new application-driven, floorplan-aware approach to voltage partitioning and island creation with the objective of reducing overall SoC power, area and runtime. Previous approaches used the voltage assignment table as the starting point for voltage island creation. In this paper, we present a technique to generate a voltage assignment table using dynamic programming. Next, we partition the cores into islands, based on the Power State Model (PSM) of the application, and connectivity information used in floorplanning. Finally, solutions are sent to the floorplanner in sequence until a valid solution is reached. Compared to previously reported techniques, a 10% reduction in power and 8% reduction in area are achieved using our approach, with an average runtime improvement of 2.3X.
{"title":"Application-driven floorplan-aware voltage island design","authors":"D. Sengupta, R. Saleh","doi":"10.1145/1391469.1391511","DOIUrl":"https://doi.org/10.1145/1391469.1391511","url":null,"abstract":"Among the different methods of reducing power for core-based system-on-chip (SoC) designs, the voltage island technique has gained in popularity. Assigning cores to the different supply voltages and floorplanning to create contiguous voltage islands are the two important steps in the design process. We propose a new application-driven, floorplan-aware approach to voltage partitioning and island creation with the objective of reducing overall SoC power, area and runtime. Previous approaches used the voltage assignment table as the starting point for voltage island creation. In this paper, we present a technique to generate a voltage assignment table using dynamic programming. Next, we partition the cores into islands, based on the Power State Model (PSM) of the application, and connectivity information used in floorplanning. Finally, solutions are sent to the floorplanner in sequence until a valid solution is reached. Compared to previously reported techniques, a 10% reduction in power and 8% reduction in area are achieved using our approach, with an average runtime improvement of 2.3X.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117111127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Alkabani, T. Massey, F. Koushanfar, M. Potkonjak
We present the first approach for post-silicon leakage power reduction through input vector control (IVC) that takes into account the impact of the manufacturing variability (MV). Because of the MV, the integrated circuits (ICs) implementing one design require different input vectors to achieve their lowest leakage states. We address two major challenges. The first is the extraction of the gate- level characteristics of an IC by measuring only the overall leakage power for different inputs. The second problem is the rapid generation of input vectors that result in a low leakage for a large number of unique ICs that implement a given design, but are different in the post-manufacturing phase. Experimental results on a large set of benchmark instances demonstrate the efficiency of the proposed methods. For example, the leakage power consumption could be reduced in average by more than 10.4%, when compared to the previously published IVC techniques that did not consider MV.
{"title":"Input vector control for post-silicon leakage current minimization in the presence of manufacturing variability","authors":"Y. Alkabani, T. Massey, F. Koushanfar, M. Potkonjak","doi":"10.1145/1391469.1391624","DOIUrl":"https://doi.org/10.1145/1391469.1391624","url":null,"abstract":"We present the first approach for post-silicon leakage power reduction through input vector control (IVC) that takes into account the impact of the manufacturing variability (MV). Because of the MV, the integrated circuits (ICs) implementing one design require different input vectors to achieve their lowest leakage states. We address two major challenges. The first is the extraction of the gate- level characteristics of an IC by measuring only the overall leakage power for different inputs. The second problem is the rapid generation of input vectors that result in a low leakage for a large number of unique ICs that implement a given design, but are different in the post-manufacturing phase. Experimental results on a large set of benchmark instances demonstrate the efficiency of the proposed methods. For example, the leakage power consumption could be reduced in average by more than 10.4%, when compared to the previously published IVC techniques that did not consider MV.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116328536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In analog layout design, it is very important to reduce the parasitic coupling effects and improve the circuit performance. Consequently, the most important device-level placement constraints are matching, symmetry, and proximity. However, many previous works deal with these constraints separately, and none of them mention how to handle different constraints simultaneously and hierarchically. In this paper, we first give a case study to show the needs of integrating these constraints in a hierarchical manner. Then, we present the first formulation for analog placement based on hierarchical module clustering. Our approach can handle analog placement with various constraint groups including matching, (hierarchical) symmetry, and (hierarchical) proximity groups. To our best knowledge, this is also the first work in the literature to handle floorplanning with the clustering constraint using the B*-tree based representation. Experimental results based on industrial analog designs show that our approach is very effective and efficient.
{"title":"Analog placement based on hierarchical module clustering","authors":"Mark Po-Hung Lin, Shyh-Chang Lin","doi":"10.1145/1391469.1391484","DOIUrl":"https://doi.org/10.1145/1391469.1391484","url":null,"abstract":"In analog layout design, it is very important to reduce the parasitic coupling effects and improve the circuit performance. Consequently, the most important device-level placement constraints are matching, symmetry, and proximity. However, many previous works deal with these constraints separately, and none of them mention how to handle different constraints simultaneously and hierarchically. In this paper, we first give a case study to show the needs of integrating these constraints in a hierarchical manner. Then, we present the first formulation for analog placement based on hierarchical module clustering. Our approach can handle analog placement with various constraint groups including matching, (hierarchical) symmetry, and (hierarchical) proximity groups. To our best knowledge, this is also the first work in the literature to handle floorplanning with the clustering constraint using the B*-tree based representation. Experimental results based on industrial analog designs show that our approach is very effective and efficient.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128905000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Woo-Cheol Kwon, S. Yoo, Sungpack Hong, Byeong Min, Kyu-Myung Choi, Soo-Kwan Eo
3D stacked memory enables more off-chip DDR memories. Redesigning existing IPs to exploit the increased memory parallelism will be prohibitively costly. In our work, we propose a practical approach to exploit the increased bandwidth and reduced latency of multiple off-chip DDR memories while reusing existing IPs without modification. The proposed approach is based on two new concepts: transaction id renaming and distributed soft arbitration. We present two on-chip network components, request parallelizer and read data serializer, to realize the concepts. Experiments with synthetic test cases and an industrial strength DTV SoC design show that the proposed approach gives significant improvements in total execution cycle (21.6%) and average memory access latency (31.6%) in the DTV case with a small area overhead (30.1% in the on-chip network, and less than 1.4% in the entire chip).
{"title":"A practical approach of memory access parallelization to exploit multiple off-chip DDR memories","authors":"Woo-Cheol Kwon, S. Yoo, Sungpack Hong, Byeong Min, Kyu-Myung Choi, Soo-Kwan Eo","doi":"10.1145/1391469.1391585","DOIUrl":"https://doi.org/10.1145/1391469.1391585","url":null,"abstract":"3D stacked memory enables more off-chip DDR memories. Redesigning existing IPs to exploit the increased memory parallelism will be prohibitively costly. In our work, we propose a practical approach to exploit the increased bandwidth and reduced latency of multiple off-chip DDR memories while reusing existing IPs without modification. The proposed approach is based on two new concepts: transaction id renaming and distributed soft arbitration. We present two on-chip network components, request parallelizer and read data serializer, to realize the concepts. Experiments with synthetic test cases and an industrial strength DTV SoC design show that the proposed approach gives significant improvements in total execution cycle (21.6%) and average memory access latency (31.6%) in the DTV case with a small area overhead (30.1% in the on-chip network, and less than 1.4% in the entire chip).","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123145760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern microprocessors are becoming increasingly parallel devices, and GPUs are at the leading edge of this trend. Designing parallel algorithms for manycore chips like the GPU can present interesting challenges, particularly for computations on sparse data structures. One particularly common example is the collection of sparse matrix solvers and combinatorial graph algorithms that form the core of many physical simulation techniques. Although seemingly irregular, these operations can often be implemented with data parallel operations that map very well to massively parallel processors.
{"title":"Sparse matrix computations on manycore GPU’s","authors":"M. Garland","doi":"10.1145/1391469.1391473","DOIUrl":"https://doi.org/10.1145/1391469.1391473","url":null,"abstract":"Modern microprocessors are becoming increasingly parallel devices, and GPUs are at the leading edge of this trend. Designing parallel algorithms for manycore chips like the GPU can present interesting challenges, particularly for computations on sparse data structures. One particularly common example is the collection of sparse matrix solvers and combinatorial graph algorithms that form the core of many physical simulation techniques. Although seemingly irregular, these operations can often be implemented with data parallel operations that map very well to massively parallel processors.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121592186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Wang, W. Luk, Xuan Zeng, Jun Tao, Changhao Yan, J. Tong, W. Cai, Jia Ni
In nanometer technologies, process variations possess growing nonlinear impacts on circuit performance, which causes critical path delays of combinatorial circuits variate randomly with non-Gaussian distribution. In this paper, we propose a novel clock skew scheduling methodology that optimizes timing yield by handling non-Gaussian distributions of critical path delays. Firstly a general formulation of the optimization problem is proposed, which covers most of the previous formulations and indicates their limitations with statistical interpretations. Then a generalized minimum balancing algorithm is proposed for effectively solving the skew scheduling problem. Experimental results show that the proposed method significantly outperforms some representative methods previously proposed for yield optimization, and could obtain timing yield improvements up to 33.6% and averagely 17.7%.
{"title":"Timing yield driven clock skew scheduling considering non-Gaussian distributions of critical path delays","authors":"Yi Wang, W. Luk, Xuan Zeng, Jun Tao, Changhao Yan, J. Tong, W. Cai, Jia Ni","doi":"10.1145/1391469.1391525","DOIUrl":"https://doi.org/10.1145/1391469.1391525","url":null,"abstract":"In nanometer technologies, process variations possess growing nonlinear impacts on circuit performance, which causes critical path delays of combinatorial circuits variate randomly with non-Gaussian distribution. In this paper, we propose a novel clock skew scheduling methodology that optimizes timing yield by handling non-Gaussian distributions of critical path delays. Firstly a general formulation of the optimization problem is proposed, which covers most of the previous formulations and indicates their limitations with statistical interpretations. Then a generalized minimum balancing algorithm is proposed for effectively solving the skew scheduling problem. Experimental results show that the proposed method significantly outperforms some representative methods previously proposed for yield optimization, and could obtain timing yield improvements up to 33.6% and averagely 17.7%.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126260156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Handheld mobile terminals have developed from simple phones to devices featuring a wide variety of modern multimedia functions, being in fact multimedia computers. In today's mobile terminals, computational demand is closes to that of personal desktop computers only a few years ago. All these new features need more power and bandwidth in interconnections. New innovations must be implemented in these devices with ever increasing speed.
{"title":"Standard interfaces in mobile terminals — increasing the efficiency of device design and accelerating innovation","authors":"R. Savolainen, T. Rissa","doi":"10.1145/1391469.1391619","DOIUrl":"https://doi.org/10.1145/1391469.1391619","url":null,"abstract":"Handheld mobile terminals have developed from simple phones to devices featuring a wide variety of modern multimedia functions, being in fact multimedia computers. In today's mobile terminals, computational demand is closes to that of personal desktop computers only a few years ago. All these new features need more power and bandwidth in interconnections. New innovations must be implemented in these devices with ever increasing speed.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114078647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Future CPU directions are increasingly emphasizing parallel compute platforms which are critically dependent upon upon greater core to core communication as well as generally stressing the overall memory and storage interconnect hierarchy to a much greater degree than extrapolations of past platform needs. Performance is critically dependent upon memory bandwidth and latency but must be moderated with power and cost considerations. 3D stacking of CPU's and memory (i.e. a last level cache) is a potential solution that provides the necessary bandwidth within a reasonable power envelope.
{"title":"Tera-scale computing and interconnect challenges","authors":"J. Bautista","doi":"10.1145/1391469.1391641","DOIUrl":"https://doi.org/10.1145/1391469.1391641","url":null,"abstract":"Future CPU directions are increasingly emphasizing parallel compute platforms which are critically dependent upon upon greater core to core communication as well as generally stressing the overall memory and storage interconnect hierarchy to a much greater degree than extrapolations of past platform needs. Performance is critically dependent upon memory bandwidth and latency but must be moderated with power and cost considerations. 3D stacking of CPU's and memory (i.e. a last level cache) is a potential solution that provides the necessary bandwidth within a reasonable power envelope.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131542980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Huang, M. Stan, K. Sankaranarayanan, R. J. Ribando, K. Skadron
Air cooling limits have been a major design challenge in recent years for integrated circuits. Multi-core exacerbates thermal challenges because power scales with the number of cores, but also creates new opportunities for temperature-aware design, because multi-core designs offer more design parameters than single-core designs. This paper investigates the relationship between core size and on-chip hot spot temperature and shows that with the same power density, smaller cores are cooler than larger cores due to a spatial low-pass filtering effect of temperature. This phenomenon suggests that designs exploiting low-pass filtering can dissipate more power within the same cooling budget than contemporary designs.
{"title":"Many-core design from a thermal perspective","authors":"Wei Huang, M. Stan, K. Sankaranarayanan, R. J. Ribando, K. Skadron","doi":"10.1145/1391469.1391660","DOIUrl":"https://doi.org/10.1145/1391469.1391660","url":null,"abstract":"Air cooling limits have been a major design challenge in recent years for integrated circuits. Multi-core exacerbates thermal challenges because power scales with the number of cores, but also creates new opportunities for temperature-aware design, because multi-core designs offer more design parameters than single-core designs. This paper investigates the relationship between core size and on-chip hot spot temperature and shows that with the same power density, smaller cores are cooler than larger cores due to a spatial low-pass filtering effect of temperature. This phenomenon suggests that designs exploiting low-pass filtering can dissipate more power within the same cooling budget than contemporary designs.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115148180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Today, embedded processors are expected to be able to run complex, algorithm-heavy applications that were originally designed and coded for general-purpose processors. As a result, traditional methods for addressing performance and determinism become inadequate. This paper explores a new data cache design for use in modern high-performance embedded processors that will dynamically improve execution time, power efficiency, and determinism within the system. The simulation results show significant improvement in cache miss ratios and reduction in power consumption of approximately 30% and 15%, respectively.
{"title":"Miss reduction in embedded processors through dynamic, power-friendly cache design","authors":"Garo Bournoutian, A. Orailoglu","doi":"10.1145/1391469.1391546","DOIUrl":"https://doi.org/10.1145/1391469.1391546","url":null,"abstract":"Today, embedded processors are expected to be able to run complex, algorithm-heavy applications that were originally designed and coded for general-purpose processors. As a result, traditional methods for addressing performance and determinism become inadequate. This paper explores a new data cache design for use in modern high-performance embedded processors that will dynamically improve execution time, power efficiency, and determinism within the system. The simulation results show significant improvement in cache miss ratios and reduction in power consumption of approximately 30% and 15%, respectively.","PeriodicalId":412696,"journal":{"name":"2008 45th ACM/IEEE Design Automation Conference","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134186213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}