Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105394
Haris Javaid, M. Shafique, J. Henkel, S. Parameswaran
System-level dynamic power management (DPM) schemes in Multiprocessor System on Chips (MPSoCs) exploit the idleness of processors to reduce the energy consumption by putting idle processors to low-power states. In the presence of multiple low-power states, the challenge is to predict the duration of the idle period with high accuracy so that the most beneficial power state can be selected for the idle processor. In this work, we propose a novel dynamic power management scheme for adaptive pipelined MPSoCs, suitable for multimedia applications. We leverage application knowledge in the form of future workload prediction to forecast the duration of idle periods. The predicted duration is then used to select an appropriate power state for the idle processor. We proposed five heuristics as part of the DPM and compared their effectiveness using an MPSoC implementation of the H.264 video encoder supporting HD720p at 30 fps. The results show that one of the application prediction based heuristic (MAMAPBH) predicted the most beneficial power states for idle processors with less than 3% error when compared to an optimal solution. In terms of energy savings, MAMAPBH was always within 1% of the energy savings of the optimal solution. When compared with a naive approach (where only one of the possible power states is used for all the idle processors), MAMAPBH achieved up to 40% more energy savings with only 0.5% degradation in throughput. These results signify the importance of leveraging application knowledge at system-level for dynamic power management schemes.
{"title":"System-level application-aware dynamic power management in adaptive pipelined MPSoCs for multimedia","authors":"Haris Javaid, M. Shafique, J. Henkel, S. Parameswaran","doi":"10.1109/ICCAD.2011.6105394","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105394","url":null,"abstract":"System-level dynamic power management (DPM) schemes in Multiprocessor System on Chips (MPSoCs) exploit the idleness of processors to reduce the energy consumption by putting idle processors to low-power states. In the presence of multiple low-power states, the challenge is to predict the duration of the idle period with high accuracy so that the most beneficial power state can be selected for the idle processor. In this work, we propose a novel dynamic power management scheme for adaptive pipelined MPSoCs, suitable for multimedia applications. We leverage application knowledge in the form of future workload prediction to forecast the duration of idle periods. The predicted duration is then used to select an appropriate power state for the idle processor. We proposed five heuristics as part of the DPM and compared their effectiveness using an MPSoC implementation of the H.264 video encoder supporting HD720p at 30 fps. The results show that one of the application prediction based heuristic (MAMAPBH) predicted the most beneficial power states for idle processors with less than 3% error when compared to an optimal solution. In terms of energy savings, MAMAPBH was always within 1% of the energy savings of the optimal solution. When compared with a naive approach (where only one of the possible power states is used for all the idle processors), MAMAPBH achieved up to 40% more energy savings with only 0.5% degradation in throughput. These results signify the importance of leveraging application knowledge at system-level for dynamic power management schemes.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"23 1","pages":"616-623"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75740287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105316
M. S. Haque, Jorgen Peddersen, S. Parameswaran
An application's cache miss rate is used in timing analysis, system performance prediction and in deciding the best cache memory for an embedded system to meet tighter constraints. Single-pass simulation allows a designer to find the number of cache misses quickly and accurately on various cache memories. Such single-pass simulation systems have previously relied heavily on cache inclusion properties, which allowed rapid simulation of cache configurations for different applications. Thus far the only inclusion properties discovered were applicable to the Least Recently Used (LRU) replacement policy based caches. However, LRU based caches are rarely implemented in real life due to their circuit complexity at larger cache associativities. Embedded processors typically use a FIFO replacement policy in their caches instead, for which there are no full inclusion properties to exploit. In this paper, for the first time, we introduce a cache property called the “Intersection Property” that helps to reduce single-pass simulation time in a manner similar to inclusion property. An intersection property defines conditions that if met, prove a particular element exists in larger caches, thus avoiding further search time. We have discussed three such intersection properties for caches using the FIFO replacement policy in this paper. A rapid single-pass FIFO cache simulator “CIPARSim” has also been proposed. CIPARSim is the first single-pass simulator dependent on the FIFO cache properties to reduce simulation time significantly. CIPARSim's simulation time was up to 5 times faster (on average 3 times faster) compared to the state of the art single-pass FIFO cache simulator for the cache configurations tested. CIPARSim produces the cache hit and miss rates of an application accurately on various cache configurations. During simulation, CIPARSim's intersection properties alone predict up to 90% (on average 65%) of the total hits, reducing simulation time immensely.
{"title":"CIPARSim: Cache intersection property assisted rapid single-pass FIFO cache simulation technique","authors":"M. S. Haque, Jorgen Peddersen, S. Parameswaran","doi":"10.1109/ICCAD.2011.6105316","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105316","url":null,"abstract":"An application's cache miss rate is used in timing analysis, system performance prediction and in deciding the best cache memory for an embedded system to meet tighter constraints. Single-pass simulation allows a designer to find the number of cache misses quickly and accurately on various cache memories. Such single-pass simulation systems have previously relied heavily on cache inclusion properties, which allowed rapid simulation of cache configurations for different applications. Thus far the only inclusion properties discovered were applicable to the Least Recently Used (LRU) replacement policy based caches. However, LRU based caches are rarely implemented in real life due to their circuit complexity at larger cache associativities. Embedded processors typically use a FIFO replacement policy in their caches instead, for which there are no full inclusion properties to exploit. In this paper, for the first time, we introduce a cache property called the “Intersection Property” that helps to reduce single-pass simulation time in a manner similar to inclusion property. An intersection property defines conditions that if met, prove a particular element exists in larger caches, thus avoiding further search time. We have discussed three such intersection properties for caches using the FIFO replacement policy in this paper. A rapid single-pass FIFO cache simulator “CIPARSim” has also been proposed. CIPARSim is the first single-pass simulator dependent on the FIFO cache properties to reduce simulation time significantly. CIPARSim's simulation time was up to 5 times faster (on average 3 times faster) compared to the state of the art single-pass FIFO cache simulator for the cache configurations tested. CIPARSim produces the cache hit and miss rates of an application accurately on various cache configurations. During simulation, CIPARSim's intersection properties alone predict up to 90% (on average 65%) of the total hits, reducing simulation time immensely.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"21 1","pages":"126-133"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74819565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105383
Xueqian Zhao, Jia Wang, Zhuo Feng, Shiyan Hu
It is increasingly challenging to analyze present day large-scale power delivery networks (PDNs) due to the drastically growing complexity in power grid design. To achieve greater runtime and memory efficiencies, a variety of preconditioned iterative algorithms has been investigated in the past few decades with promising performance, while incremental power grid analysis also becomes popular to facilitate fast re-simulations of corrected designs. Although existing preconditioned solvers, such as incomplete matrix factor-based preconditioners, usually exhibit high efficiency in memory usage, their convergence behaviors are not always satisfactory. In this work, we present a novel hierarchical support-graph preconditioned iterative algorithm that constructs preconditioners by generating spanning trees in power supply networks for fast power grid analysis. The support-graph preconditioner is efficient for handling complex power grid structures (regular or irregular grids), and can facilitate very fast incremental analysis. Our experimental results on IBM power grid benchmarks show that compared with the best direct or iterative solvers, the proposed support-graph preconditioned iterative solver achieves up to 3.6X speedups for DC analysis, and up to 22X speedups for incremental analysis, while reducing the memory consumption by a factor of four.
{"title":"Power grid analysis with hierarchical support graphs","authors":"Xueqian Zhao, Jia Wang, Zhuo Feng, Shiyan Hu","doi":"10.1109/ICCAD.2011.6105383","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105383","url":null,"abstract":"It is increasingly challenging to analyze present day large-scale power delivery networks (PDNs) due to the drastically growing complexity in power grid design. To achieve greater runtime and memory efficiencies, a variety of preconditioned iterative algorithms has been investigated in the past few decades with promising performance, while incremental power grid analysis also becomes popular to facilitate fast re-simulations of corrected designs. Although existing preconditioned solvers, such as incomplete matrix factor-based preconditioners, usually exhibit high efficiency in memory usage, their convergence behaviors are not always satisfactory. In this work, we present a novel hierarchical support-graph preconditioned iterative algorithm that constructs preconditioners by generating spanning trees in power supply networks for fast power grid analysis. The support-graph preconditioner is efficient for handling complex power grid structures (regular or irregular grids), and can facilitate very fast incremental analysis. Our experimental results on IBM power grid benchmarks show that compared with the best direct or iterative solvers, the proposed support-graph preconditioned iterative solver achieves up to 3.6X speedups for DC analysis, and up to 22X speedups for incremental analysis, while reducing the memory consumption by a factor of four.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"30 1","pages":"543-547"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81620021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105347
M. Grange, A. Jantsch, R. Weerasekera, D. Pamunuwa
Hierarchical models from physical to system-level are proposed for architectural exploration of high-performance silicon systems to quantify the performance and cost trade offs for 2-D and 3-D IC implementations. We show that 3-D systems can reduce interconnect delay and energy by up to an order of magnitude over 2-D, with an increase of 20–30% in performance-per-watt for every doubling of stack height. Contrary to previous analysis, the improved energy efficiency is achievable at a favorable cost. The models are packaged as a standalone tool and can provide fast estimation of coarse-grain performance and cost limitations for a variety of processing systems to be used at the early chip-planning phase of the design cycle.
{"title":"Modeling the computational efficiency of 2-D and 3-D silicon processors for early-chip planning","authors":"M. Grange, A. Jantsch, R. Weerasekera, D. Pamunuwa","doi":"10.1109/ICCAD.2011.6105347","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105347","url":null,"abstract":"Hierarchical models from physical to system-level are proposed for architectural exploration of high-performance silicon systems to quantify the performance and cost trade offs for 2-D and 3-D IC implementations. We show that 3-D systems can reduce interconnect delay and energy by up to an order of magnitude over 2-D, with an increase of 20–30% in performance-per-watt for every doubling of stack height. Contrary to previous analysis, the improved energy efficiency is achievable at a favorable cost. The models are packaged as a standalone tool and can provide fast estimation of coarse-grain performance and cost limitations for a variety of processing systems to be used at the early chip-planning phase of the design cycle.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"16 1","pages":"310-317"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84402306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105401
Rangharajan Venkatesan, A. Agarwal, K. Roy, A. Raghunathan
Approximate computing, which refers to a class of techniques that relax the requirement of exact equivalence between the specification and implementation of a computing system, has attracted significant interest in recent years. We propose a systematic methodology, called MACACO, for the Modeling and Analysis of Circuits for Approximate Computing. The proposed methodology can be utilized to analyze how an approximate circuit behaves with reference to a conventional correct implementation, by computing metrics such as worst-case error, average-case error, error probability, and error distribution. The methodology applies to both timing-induced approximations such as voltage over-scaling or over-clocking, and functional approximations based on logic complexity reduction. The first step in MACACO is the construction of an equivalent untimed circuit that represents the behavior of the approximate circuit at a given voltage and clock period. Next, we construct a virtual error circuit that represents the error in the approximate circuit's output for any given input or input sequence. Finally, we apply conventional Boolean analysis techniques (SAT solvers, BDDs) and statistical techniques (Monte-Carlo simulation) in order to compute the various metrics of interest. We have applied the proposed methodology to analyze a range of approximate designs for datapath building blocks. Our results show that MACACO can help a designer to systematically evaluate the impact of approximate circuits, and to choose between different approximate implementations, thereby facilitating the adoption of such circuits for approximate computing.
{"title":"MACACO: Modeling and analysis of circuits for approximate computing","authors":"Rangharajan Venkatesan, A. Agarwal, K. Roy, A. Raghunathan","doi":"10.1109/ICCAD.2011.6105401","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105401","url":null,"abstract":"Approximate computing, which refers to a class of techniques that relax the requirement of exact equivalence between the specification and implementation of a computing system, has attracted significant interest in recent years. We propose a systematic methodology, called MACACO, for the Modeling and Analysis of Circuits for Approximate Computing. The proposed methodology can be utilized to analyze how an approximate circuit behaves with reference to a conventional correct implementation, by computing metrics such as worst-case error, average-case error, error probability, and error distribution. The methodology applies to both timing-induced approximations such as voltage over-scaling or over-clocking, and functional approximations based on logic complexity reduction. The first step in MACACO is the construction of an equivalent untimed circuit that represents the behavior of the approximate circuit at a given voltage and clock period. Next, we construct a virtual error circuit that represents the error in the approximate circuit's output for any given input or input sequence. Finally, we apply conventional Boolean analysis techniques (SAT solvers, BDDs) and statistical techniques (Monte-Carlo simulation) in order to compute the various metrics of interest. We have applied the proposed methodology to analyze a range of approximate designs for datapath building blocks. Our results show that MACACO can help a designer to systematically evaluate the impact of approximate circuits, and to choose between different approximate implementations, thereby facilitating the adoption of such circuits for approximate computing.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"3 1","pages":"667-673"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84916878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105313
A. Rogachev, Lu Wan, Deming Chen
With technology scaling, the variability of device parameters continues to increase. This impacts both the performance and the temperature profile of the die turning them into a statistical distribution. To the best of our knowledge, no one has considered the impact of the statistical thermal profile during statistical analysis of the propagation delay. We present a statistical static timing analysis (SSTA) tool which considers this interdependence and produces accurate timing estimation. Our average errors for mean and standard deviation are 0.95% and 3.5% respectively when compared against Monte Carlo simulation. This is a significant improvement over SSTA that assumes a deterministic power profile, whose mean and SD errors are 3.7% and 20.9% respectively. However, when considering >90% performance yield, our algorithm's accuracy improvement was not as significant when compared to the deterministic power case. Thus, if one is concerned with the runtime, a reasonable estimate of the performance yield can be obtained by assuming nominal power. Nevertheless, a full statistical analysis is necessary to achieve maximum accuracy.
{"title":"Temperature aware statistical static timing analysis","authors":"A. Rogachev, Lu Wan, Deming Chen","doi":"10.1109/ICCAD.2011.6105313","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105313","url":null,"abstract":"With technology scaling, the variability of device parameters continues to increase. This impacts both the performance and the temperature profile of the die turning them into a statistical distribution. To the best of our knowledge, no one has considered the impact of the statistical thermal profile during statistical analysis of the propagation delay. We present a statistical static timing analysis (SSTA) tool which considers this interdependence and produces accurate timing estimation. Our average errors for mean and standard deviation are 0.95% and 3.5% respectively when compared against Monte Carlo simulation. This is a significant improvement over SSTA that assumes a deterministic power profile, whose mean and SD errors are 3.7% and 20.9% respectively. However, when considering >90% performance yield, our algorithm's accuracy improvement was not as significant when compared to the deterministic power case. Thus, if one is concerned with the runtime, a reasonable estimate of the performance yield can be obtained by assuming nominal power. Nevertheless, a full statistical analysis is necessary to achieve maximum accuracy.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"1 1","pages":"103-110"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81978624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105417
Mac Y. C. Kao, Kun-Ting Tsai, Shih-Chieh Chang
Clock skew minimization is important in VLSI design field. Due to the presence of Process, Voltage, and Temperature (PVT) variations, the Post-Silicon Skew Tuning (PST) technique with the ability of tolerating PVT variations has brought a broad discussion. A PST architecture can dynamically minimize the clock skew even after a chip is manufactured. However, testing the variation tolerance ability of a PST architecture is very difficult because the clock skew does not directly affect the functionality of a design. In addition, creating PVT variation in the traditional testing environment is not easy. Unlike most previous works which focus on the implementation and the performance issues of a PST architecture, the objective of this paper is to propose efficient test mechanisms and verify the variation tolerance ability. In addition, we also propose a novel structure to increase the robustness of a PST architecture in case of a manufacturing fault. Our experiment shows that with little overhead, we can achieve robustness.
{"title":"A robust architecture for post-silicon skew tuning","authors":"Mac Y. C. Kao, Kun-Ting Tsai, Shih-Chieh Chang","doi":"10.1109/ICCAD.2011.6105417","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105417","url":null,"abstract":"Clock skew minimization is important in VLSI design field. Due to the presence of Process, Voltage, and Temperature (PVT) variations, the Post-Silicon Skew Tuning (PST) technique with the ability of tolerating PVT variations has brought a broad discussion. A PST architecture can dynamically minimize the clock skew even after a chip is manufactured. However, testing the variation tolerance ability of a PST architecture is very difficult because the clock skew does not directly affect the functionality of a design. In addition, creating PVT variation in the traditional testing environment is not easy. Unlike most previous works which focus on the implementation and the performance issues of a PST architecture, the objective of this paper is to propose efficient test mechanisms and verify the variation tolerance ability. In addition, we also propose a novel structure to increase the robustness of a PST architecture in case of a manufacturing fault. Our experiment shows that with little overhead, we can achieve robustness.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"40 1","pages":"774-778"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77219046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105396
Dongjin Lee, I. Markov
Recent improvements in clock-tree and mesh-based topologies maintain a healthy competition between the two. Trees require much smaller capacitance, but meshes are naturally robust against process variation and can accommodate late design changes. Cross-link insertion has been advocated to make trees more robust, but is limited in practice to short distances. In this work we develop a novel non-tree topology that fuses several clock trees to create large-scale redundancy in a clock network. Empirical validation shows that our novel clock-network structure incrementally enhances robustness to satisfy given variation constraints. Our implementation called Contango3.0 produces robust clock networks even for challenging skew limits, without parallel buffering used by other implementations. It also offers a fine trade-off between power and robustness, increasing the capacitance of the initial tree by less than 60%, which results in 2.3× greater power efficiency than mesh structures.
{"title":"Multilevel tree fusion for robust clock networks","authors":"Dongjin Lee, I. Markov","doi":"10.1109/ICCAD.2011.6105396","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105396","url":null,"abstract":"Recent improvements in clock-tree and mesh-based topologies maintain a healthy competition between the two. Trees require much smaller capacitance, but meshes are naturally robust against process variation and can accommodate late design changes. Cross-link insertion has been advocated to make trees more robust, but is limited in practice to short distances. In this work we develop a novel non-tree topology that fuses several clock trees to create large-scale redundancy in a clock network. Empirical validation shows that our novel clock-network structure incrementally enhances robustness to satisfy given variation constraints. Our implementation called Contango3.0 produces robust clock networks even for challenging skew limits, without parallel buffering used by other implementations. It also offers a fine trade-off between power and robustness, increasing the capacitance of the initial tree by less than 60%, which results in 2.3× greater power efficiency than mesh structures.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"17 1","pages":"632-639"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76186912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105397
Seungwhun Paik, Gi-Joon Nam, Youngsoo Shin
A pulsed-latch can be modeled as a fast flip-flop. This allows conventional flip-flop designs to be migrated to pulsed-latch versions by simple replacement to reduce the clocking power. A key step in the migration process is to insert pulsers, which generate clock pulse to drive local latches; the number of pulsers as well as the wirelength of clock routing must be minimized to reduce the clocking power. We formulate a pulser insertion problem to find a set of latch groups where each group shares a pulser and its load constraint is satisfied; both an ILP formulation and a heuristic algorithm are presented to solve the problem. Experimental results of circuits implemented with 32-nm CMOS technology show that the clocking power of pulsed-latch designs obtained by our approach is 5.9% less than that of greedy approach; this is 44.7% less than that of flip-flop designs. We also consider the problem of pulsed-register where a pulser is integrated with multiple latches. A concept of logical distance is explored during our clustering algorithm to minimize the overhead of signal wirelength when converting flip-flops to pulsed-registers. Compared with flip-flop circuits, signal wirelength is increased by 6.3%, which is 1.4% smaller than without considering logical distance, while reducing the clocking power by 24%.
{"title":"Implementation of pulsed-latch and pulsed-register circuits to minimize clocking power","authors":"Seungwhun Paik, Gi-Joon Nam, Youngsoo Shin","doi":"10.1109/ICCAD.2011.6105397","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105397","url":null,"abstract":"A pulsed-latch can be modeled as a fast flip-flop. This allows conventional flip-flop designs to be migrated to pulsed-latch versions by simple replacement to reduce the clocking power. A key step in the migration process is to insert pulsers, which generate clock pulse to drive local latches; the number of pulsers as well as the wirelength of clock routing must be minimized to reduce the clocking power. We formulate a pulser insertion problem to find a set of latch groups where each group shares a pulser and its load constraint is satisfied; both an ILP formulation and a heuristic algorithm are presented to solve the problem. Experimental results of circuits implemented with 32-nm CMOS technology show that the clocking power of pulsed-latch designs obtained by our approach is 5.9% less than that of greedy approach; this is 44.7% less than that of flip-flop designs. We also consider the problem of pulsed-register where a pulser is integrated with multiple latches. A concept of logical distance is explored during our clustering algorithm to minimize the overhead of signal wirelength when converting flip-flops to pulsed-registers. Compared with flip-flop circuits, signal wirelength is increased by 6.3%, which is 1.4% smaller than without considering logical distance, while reducing the clocking power by 24%.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"15 1","pages":"640-646"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88022249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-11-07DOI: 10.1109/ICCAD.2011.6105350
J. Krichmar, N. Dutt, J. Nageswaran, Micah Richert
Biological neural systems are well known for their robust and power-efficient operation in highly noisy environments. We outline key modeling abstractions for the brain and focus on spiking neural network models. We discuss aspects of neuronal processing and computational issues related to modeling these processes. Although many of these algorithms can be efficiently realized in specialized hardware, we present a case study of simulation of the visual cortex using a GPU based simulation environment that is readily usable by neuroscientists and computer scientists and efficient enough to construct very large networks comparable to brain networks.
{"title":"Neuromorphic modeling abstractions and simulation of large-scale cortical networks","authors":"J. Krichmar, N. Dutt, J. Nageswaran, Micah Richert","doi":"10.1109/ICCAD.2011.6105350","DOIUrl":"https://doi.org/10.1109/ICCAD.2011.6105350","url":null,"abstract":"Biological neural systems are well known for their robust and power-efficient operation in highly noisy environments. We outline key modeling abstractions for the brain and focus on spiking neural network models. We discuss aspects of neuronal processing and computational issues related to modeling these processes. Although many of these algorithms can be efficiently realized in specialized hardware, we present a case study of simulation of the visual cortex using a GPU based simulation environment that is readily usable by neuroscientists and computer scientists and efficient enough to construct very large networks comparable to brain networks.","PeriodicalId":6357,"journal":{"name":"2011 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)","volume":"23 1","pages":"334-338"},"PeriodicalIF":0.0,"publicationDate":"2011-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88125723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}